Search text in PDF files using Java (Apache Lucene and Apache PDFBox)
DOWNLOAD I came across this requirement recently, to find whether a specific word is present or not in a PDF file. Initially I ...
https://www.programming-free.com/2012/11/simple-word-search-in-pdf-files-using.html?m=0
I came across this requirement recently, to find whether a specific word is present or not in a PDF file. Initially I thought this is a very simple requirement and created a simple application in Java, that would first extract text from PDF files and then do a linear character matching like mystring.contains(mysearchterm) == true. It did give me the expected output, but linear character matching operations are suitable only when the content you are searching is very small. Otherwise it is very expensive, in complexity terms O(np) where n= number of words to search and p= number of search terms.
The best solution is to go for a simple search engine which will first pre-parse all your data in to tokens to create an index and then allow us to query the index to retrieve matching results. This means the whole content will be first broken down into terms and then each of it will point to the content. For example, consider the raw data,
1,hello world
2,god is good all the time
3,all is well
4,the big bang theory
The search engine will create an index like this,
all-> 2,3
hello-> 1
is->2,3
good->2
world->1
the->2,4
god->2
big->4
Full Text Search engines are what I am referring to here and these search engines quickly and effectively search large volume of unstructured text. There are many other things you can do with a search engine but I am not going to deal with any of it in this post. The aim is to let you know how to create a simple java application that can search for a particular keyword in PDF documents and tell you whether the document contains that particular keyword or not. That being said, the open source full text search engine that I am going to use for this purpose is Apache Lucene, which is a high performance, full-featured text search engine completely written in Java. Apache Lucene does not have the ability to extract text from PDF files. All it does is, creates index from text and then enables us to query against the indices to retrieve the matching results. To extract text from PDF documents, let us use Apache PDFBox, an open source java library that will extract content from PDF documents which can be fed to Lucene for indexing.
Lets get started by downloading the required libraries. Please stick to the version of software's that I am using, since latest versions may require different kind of implementation.
1. Download Apache lucene 3.6.1 from here. Unzip the content and find lucene-core-3.6.1.jar.
2. Download Apache PDFBox 0.7.3 from here. Unzip it and find pdfbox-0.7.3.jar
3. Download fontbox-0.1.0.jar from here. This project will throw Class not found exception if this library is not present.
Next step is to create a Java Project in Eclipse. Right click the project in project explorer, Go to -> Configure build Path -> Add External jars -> add lucene-core-3.6.1.jar,pdfbox-0.7.3.jar and fontbox-0.1.0.jar -> Click OK.
4. Create a class and name it as SimplePDFSearch.java. This is the main class that is going to perform each action one by one. Copy paste the below code in this class. Edit the package name to the name of package in which you are creating this class.
package com.programmingfree.simplepdfsearch; import org.apache.lucene.queryParser.ParseException; import org.pdfbox.pdmodel.PDDocument; import org.pdfbox.util.PDFTextStripper; import java.io.File; import java.io.IOException; public class SimplePDFSearch { // location where the index will be stored. private static final String INDEX_DIR = "src/main/resources/index"; private static final int DEFAULT_RESULT_SIZE = 100; public static void main(String[] args) throws IOException, ParseException { File pdfFile = new File("src/resources/SamplePDF.pdf"); IndexItem pdfIndexItem = index(pdfFile); // creating an instance of the indexer class and indexing the items Indexer indexer = new Indexer(INDEX_DIR); indexer.index(pdfIndexItem); indexer.close(); // creating an instance of the Searcher class to the query the index Searcher searcher = new Searcher(INDEX_DIR); int result = searcher.findByContent("Hello", DEFAULT_RESULT_SIZE); print(result); searcher.close(); } //Extract text from PDF document public static IndexItem index(File file) throws IOException { PDDocument doc = PDDocument.load(file); String content = new PDFTextStripper().getText(doc); doc.close(); return new IndexItem((long)file.getName().hashCode(), file.getName(), content); } //Print the results private static void print(int result) { if(result==1) System.out.println("The document contains the search keyword"); else System.out.println("The document does not contain the search keyword"); } }
5. We have to create a class to set and get the items that need to be indexed from a PDF file. Create a class and name it as IndexItem.java and copy the below code and paste in it. By doing this we are instructing the search engine to create and to retrieve the following contents of the PDF file, an Unique ID, the file name and the contents (text) of the file.
package com.programmingfree.simplepdfsearch; public class IndexItem { private Long id; private String title; private String content; public static final String ID = "id"; public static final String TITLE = "title"; public static final String CONTENT = "content"; public IndexItem(Long id, String title, String content) { this.id = id; this.title = title; this.content = content; } public Long getId() { return id; } public String getTitle() { return title; } public String getContent() { return content; } @Override public String toString() { return "IndexItem{" + "id=" + id + ", title='" + title + '\'' + ", content='" + content + '\'' + '}'; } }
6. Next step is to create a class to index the contents of the PDF documents. Create a new class and name it as Indexer.java as we have referred here. Copy and paste the below code to Indexer.java,
package com.programmingfree.simplepdfsearch; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.Term; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.util.Version; import java.io.File; import java.io.IOException; public class Indexer { private IndexWriter writer; public Indexer(String indexDir) throws IOException { // create the index if(writer == null) { writer = new IndexWriter(FSDirectory.open( new File(indexDir)), new IndexWriterConfig(Version.LUCENE_36, new StandardAnalyzer(Version.LUCENE_36))); } } /** * This method will add the items into index */ public void index(IndexItem indexItem) throws IOException { // deleting the item, if already exists writer.deleteDocuments(new Term(IndexItem.ID, indexItem.getId().toString())); Document doc = new Document(); doc.add(new Field(IndexItem.ID, indexItem.getId().toString(), Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field(IndexItem.TITLE, indexItem.getTitle(), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field(IndexItem.CONTENT, indexItem.getContent(), Field.Store.YES, Field.Index.ANALYZED)); // add the document to the index writer.addDocument(doc); } /** * Closing the index */ public void close() throws IOException { writer.close(); } }
7. The last step is to create a class that provides features to query the index that is created using the indexer class. Create a class and name it as Searcher.java. Copy and paste the below code in it.
package com.programmingfree.simplepdfsearch; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.index.IndexReader; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.*; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.util.Version; import java.io.File; import java.io.IOException; import java.util.ArrayList; import java.util.List; public class Searcher { private IndexSearcher searcher; private QueryParser contentQueryParser; public Searcher(String indexDir) throws IOException { // open the index directory to search searcher = new IndexSearcher(IndexReader.open(FSDirectory.open(new File(indexDir)))); StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36); // defining the query parser to search items by content field. contentQueryParser = new QueryParser(Version.LUCENE_36, IndexItem.CONTENT, analyzer); } /** * This method is used to find the indexed items by the content. * @param queryString - the query string to search for */ public int findByContent(String queryString, int numOfResults) throws ParseException, IOException { // create query from the incoming query string. Query query = contentQueryParser.parse(queryString); // execute the query and get the results ScoreDoc[] queryResults = searcher.search(query, numOfResults).scoreDocs; if(queryResults.length>0) return 1; else return 0; } public void close() throws IOException { searcher.close(); } }
That is all we have to do before we run this program to find whether a word is present in a PDF file or not in a more quick and efficient way. Note in the main class (SimplePDFSearch.java), I have used a field named INDEX_DIR which contains the path where the index will be stored. Every time this program is run, the old index will be cleared and new index will be created. I have used a sample PDF document that consists of the following text in it,
"Hello World by PDFBox"
I am searching for the word "Hello", that is passed as a parameter to findByContent method of the Searcher class and the output is,
The document contains the search keyword
Download source code(use download button at the beginning of this article) and practice it yourself to understand this better.
Please leave your comments and queries about this post in the comment sections in order for me to improve my writing skills and to showcase more useful posts. Thanks for reading!!
Hello, i'm trying use Phrasequery to search exact phrase 'Hello World'. Can help me?
ReplyDeleteHello Luciano,
DeleteYou should PhraseQuery class instead of Query class.
// search for documents that have "foo bar" in them
String sentence = "foo bar";
IndexSearcher searcher = new IndexSearcher(directory);
PhraseQuery query = new PhraseQuery();
String[] words = sentence.split(" ");
for (String word : words) {
query.add(new Term("contents", word));
}
Check out these links for more working examples,
http://stackoverflow.com/questions/5527868/exact-phrase-search-using-lucene
http://www.avajava.com/tutorials/lessons/how-do-i-query-for-words-near-each-other-with-a-phrase-query.html
http://www.ibm.com/developerworks/java/library/os-apache-lucenesearch/
Hope this helps!
Hello Priya,
ReplyDeleteI am trying to write a java program to search a word from first page(or paragraph) of a pdf file. Searching a word and its count of ocurance is enough. Advice please. Thanks in advance.
Hi,
DeleteAs explained in the post, we are converting the content of the whole pdf file to text using pdfbox and then indexing it. So, your first requirement of analyzing the first page or paragraph alone is not possible. Next, you can very well find the number of times it occurs in the index if you build your index with content from only one pdf that is of interest to you.
If you have more than one pdf file then the count will include occurrences of the search term in all pdf files. The above post is just a sample that lets you know how to use lucene to search pdf files. I recommend you to go through the official documentation to understand which analyzer and QueryParser best suits your requirement.
Thanks,
Priya
I want multiple pages searching text in pdf file.... I try to this code working single only single pages.... so please help me .....
DeleteHi
ReplyDeletei have multiple pdf files in one folder ...so task is that in software there will be 2 input box
for
Browsing :- this will browse to that folder
name:- name of any person which you want to find (search in pdf)
and then when we will click on search it will check all the pdf available in that folder and then will check the name inside all pdf when it will get it should show the output below...
pdf file name :- first output
pdf file page no.:-second
person name:-which we searched
father's name:- searched person's father's name
sex:-M or F
Age:-
PLEASE HELP...
yes i have also the same requirement as discussed above. Please Give any solution thats fulfill my requirement.
DeleteHi Pryia,
ReplyDeleteMy application was returning error org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EI
PDDocument doc = PDDocument.load(file);
String content = new PDFTextStripper().getText(doc);
Can help me?
Thanks
Hi luciano,
DeleteWhat is the content you have inside your PDF file? Do you have any text in the PDF files? You might encounter this error when you have only images in your PDF files.
This post extracts text from PDF files and if it finds no text, then you might get this error. Please try with PDF files that has some text content in it.
Thanks,
Priya
Pryia thanks for all,
ReplyDeleteI have one more question, i'm trying to remove the accents in the search, find words removing special characters such as accents ("ANDRÉ" equals "ANDRE").
I found the class ICUTokenizer but got the error NoSuchMethodError: com.ibm.icu.text.UnicodeSet.freeze.
http://lucene.apache.org/core/4_2_0/analyzers-icu/index.html
Hi Priya..
ReplyDeleteI am into Testing. I need to write a code to verify a PDF.
I need to Search for Somewords in a PDF which contain data in a Tabular format.
Here are my queries..
How to retrieve the position of a word if it is found in the PDF.
If the word found, How to read that entire line
Please help me
Thanks in advance..
Ani
what changes would i have to do for content based search on the following files
ReplyDeletetxt,doc,pdf,xls and csv
if i search for "by" then it says doesnt contain the keyword
ReplyDeleteHi,
DeleteDo you mean to say that,search using other words (hello/world/pdfbox) all works except the word 'by' in the sample application I have provided? Or is it not working in your own application which you have implemented following the above tutorial?
Thanks,
Priya
Sorry for incomplete information...
DeleteI downloaded the source file you have given...
Configure the jar files (i got lucene-core-3.6.2 as the link you have provided was broken)
when i run the program with hello , world or pdfbox in the queryString
it gives "The document contains the search keyword"
but i give by in the queryString
it gives "The document does not contain the search keyword"
I am using the same pdf you have provided.
PDF when opened by adobe reader shows "Hello World by PDFBox"
Hi Akshit,
DeleteI know its too late for you to find this response useful. But for others who have the same question as yours, this is the reason why when you search for the word 'by' you don't get any result. It is because I have used "StandardAnalyzer" in this example which is used to index the PDF file's text content. By default, StandardAnalyzer has a set of stop words that are omitted from being indexed.
You can find a list of all the words that are filtered out by default here along with a solution to stop this behavior if you wish,
http://stackoverflow.com/questions/4871709/stop-words-in-sitecore
Thanks,
Priya
Okay thanks.
DeleteMy project was complete but I was still wondering the answer.
Thanks for the reply.
Hi All, Please find the below requirement and suggest any solution for this.
ReplyDelete--> I have a PDF document in my Local drive.
--> There is a table in the document and i need to find the exact value under a column name. e.g., i have 3 columns such as 'User ID', 'Password' and 'Type of User'. Now, i will provide the User ID and i need to get the Type of User for that ID.
Can anyone suggest if this is possible using Vb-Script or Java? If yes, please publish your thoughts. Thanks in advance.
Does it support Arabic PDF files, as I have Arabic pdf files and I want to search for specific words inside it?
ReplyDeleteHi,
DeleteYeah, certainly. You have to use ArabicAnalyzer for this. Check this out,
http://lucene.apache.org/core/3_0_3/api/contrib-analyzers/org/apache/lucene/analysis/ar/ArabicAnalyzer.html
http://stackoverflow.com/questions/2938564/lucene-2-2-arabic-analyzer
Thanks,
Priya
Hi Frd,
ReplyDeleteAbove code works fine with single word like "Hello" or "World". If i try to search "Hello World"..Its says not available in the document... Pls help me to resolve this!!!
Hi,
DeleteThe above example explains how to search for a single word only. You should use 'PhraseQuery' to do an exact phrase search. There are lots to learn in lucene. Go through the official documentation to find out which analyzer and query class best suits to you. For a quick solution, refer to this,
http://stackoverflow.com/questions/5527868/exact-phrase-search-using-lucene
Thanks,
Priya
Thanks for ur reply,
Deletei gone through this link
'http://stackoverflow.com/questions/9066347/lucene-multi-word-phrases-as-search-terms?rq=1',
and I've done some changes in the downloaded source file from here. Its looks like working correctly!!
I've shared my project with updates Download Link: http://www.mediafire.com/?is7rq3rob400mq4.
Can u pls check and tell me... what i'd done is correct..
Hi Priya,
ReplyDeleteVery nice introductory tutorial! Thanks for putting it up... It really is helpful for someone new to PDFBox and Lucene.
I am having the following scenario which I could not find in the comments before me:
I have a set of pdf documents (say 2000) created using Actuate Reports. There are scatterred key value pairs in every PDF with format like "customer=1234". Again on some other page it could be "customer=1456", etc.
I want to parse every pdf and fetch all the customer values from inside my java program.
Using the code above and modifying it as per my requirement, I think I will be able to get all the occurrences of "customer=" string and then through some String processing getting the next token before space and after "customer=" string as the value I want.
My questions are as below:
1) Is this way of getting the value is correct ? Or is there any option present in Lucene which directly fetches the value given a key as is my case.
2) My pdf documents will be around 2 to 5 pages long. So will it be ok to parse thousands of pdf at a time or will there be a performance issue ? Is there a way you can guide to improve performance ?
3) If the pdf background is white and the string "customer=1234" is also written in white color fonts (which means the will be physically invisible), then in that case will PDFBox be able to fetch the text such that I can search through lucene later ?
Thanks in advance for your help! Meanwhile, I will try to work with your program to get answers to my questions.
Keep up the good work!
Regards,
Nik
Hi Nik,
DeleteFirst of all, thanks for reading this article.
Lucene is a full text search engine, which provides quick search results when queried against a huge search index. Please post your question at stackoverflow.com after doing proper analysis on your requirements and all possible ways of implementing it.
Thanks,
Priya
I want multiple pages searching text in pdf file.... I try to this code working single only single pages.... so please help me .....
DeleteThanks PRIYA! dont have enoguh words to thank you!
ReplyDeleteMost welcome..
DeleteI did a indexing of files like pdf,ppt,docs. It display the file containing the particular word. Now I need to show the line in which the particular word occurs. Any idea on how to do that?
ReplyDeleteHi Priya, How do I get the Coordinate location of the searched text? How do we use PrintTextLocations or TextPositions or some custom class? Can you pls help?
ReplyDeleteHi Priya,
ReplyDeleteI am Karthik, been a tester for 7+ years, now i am asked to work on elasticsearch (lucene under the hood), i have spent few weeks on this and all that i did was
1) Copied few 100s of XML files into a folder
2) Converted each of them into JSON object and indexed it as Documents
( using http://www.json.org/java/index.html)
3) As part of elastisearch mapping (schema of XML) , all the elements of xml became fields with their correspoding type (String, Long , int , Date etc)
could please send me an email (karthikbm1809@gmail.com) ,need to ask you on indexing of PDF,HTML and XML files in the actual way
Regds
Karthik
Hi Karthik,
DeleteI am no elasticsearch expert. All I would suggest is to go through required documents or get help from elasticsearch forum to proceed in the right way. Search is very interesting as always and I hope you find it easy after you are done with the exploration. Good luck!
How to get the word count in the pdf file
ReplyDeleteThanks Thanks a lot..
ReplyDeleteit really helped me.
Keep doing good work like this..
All the best :)
Hi Priya, I am not able download the file through the given link. it is showing error. Can you please provide the alternate link to download.
ReplyDeleteDown link is updated now.
DeleteHi Priya,
ReplyDeleteThanks for this very good post.
However my requirement for a POC on concepts like classification and indexing documents(PDF, word doc, XML, text..etc) and search among them. When I am using lucene library to do though indexing is working with simple API for pdf and xml files, but when i am executing search the correct result is not coming as output. Could you please suggest some thoughts on this ?
Is this your website too ? http://geekonjava.blogspot.com/2015/08/search-text-in-pdf-using-java-apache.html I don't see your name as the author here.
ReplyDeleteThis is yet another copycat who have stolen my content. Thanks for letting know.
DeleteHello priya,
ReplyDeleteThanks for your advice.
I am having 8 number of pdf files and I want to search a word in all these 8 pdf but I want the output only the pdf files which contains that my given searching word.
Please advice me how to do it in java and if you have any related link for that please post here.
Once again thank you.
Hey!
DeleteI also want to do a similar thing. Did you get the code for the same?
Nice tutorial to get started with Lucene and PDF box
ReplyDeleteHello,
ReplyDeleteIs it possible to find the page number of the string being searched?
Hello,
ReplyDeleteIs it possible to find the page number of the string being searched?
This comment has been removed by the author.
ReplyDeleteException in thread "main" java.lang.NoClassDefFoundError: org/fontbox/cmap/CMapParser
ReplyDeleteat org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:534)
How can i fix this error while running the project?
Thanks in advance.
Learning Examples
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteHello, I don't know java but I need to research in a file pdf (an electronics topografic) a list of words (R1, C1, L1 etc). Your program can be used for this?
ReplyDeleteThis article is so much helpful. I followed the steps and exactly got what I wanted ! Many Many Thanks !!
ReplyDeleteIf You Want Get Discount on Shopping So Check Our Store:
ReplyDeletesmartbuyglasses promo code
If you want to save a large amount of your money then click the link. So, visit here yoshinoya deals
ReplyDeleteIf you've ever wondered how to find out when a house was originally built, you're not alone. In fact, the UK is famous for its varied housing stock. However, some of the oldest buildings date back to the 12th century, for example in Bath. Although the majority of UK housing stock is modern, you can often find evidence of the original use by studying the architecture of the area around your house. how long does a mortgage pre approval last
ReplyDelete