How to parse text from a PDF file using Apache PDFBox in Java

There are many applications in which you need to read a PDF file and extract the text in it. Maybe you want to process that text to find out the number of times a certain word keeps repeating or visualize a word cloud as in this example here Visualize a word cloud in R

This java code example will read a PDF file and print a txt file containing the text in the PDF file

Now you have to prepare the java client that will parse the PDF file. Make a Java class inside a project package in Eclipse.

The steps you have to go throught are

0- Create a java class. Download the jar files from the site of Apache PDFBox and import them in your Java project in Eclipse with build path. The jar files you have to download are : pdfbox-app-1.8.9.jar, preflight-app-1.8.9.jar , pdfbox-1.8.9.jar , fontbox-1.8.9.jar, jempbox-1.8.9.jar preflight-1.8.9.jar and xmpbox-1.8.9.jar

1- Make a method that will read the PDF file and parse it

2- write the text in a txt file.

So lets make a Java class with a main method as in the following code. By reading through the lines there are comments that explain every code piece. You should also find a pdf file containing text and put it in the java project folder from where you are executing this code. Mine is called camera-ready.pdf

     package main2pdf;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentInformation;
import org.apache.pdfbox.util.PDFTextStripper;
	 public class Main2PDFParsing {	
	     PDFParser p;
	     String p2text;
	     PDFTextStripper pdfStripper;
	     PDDocument pdDoc;
	     COSDocument cosDoc;
	     PDDocumentInformation pdDocInfo;
	     String pdftoText(String fileName) {
	        System.out.println("Reading text from PDF file " + fileName + "....");
	         File f = new File(fileName);
	         try {
	             p = new PDFParser(new FileInputStream(f));
	         } catch (Exception e) {
	             System.out.println("Cannot to open PDF Parser.");
	             return null;
	         try {
	             cosDoc = p.getDocument();
	             pdfStripper = new PDFTextStripper();
	             pdDoc = new PDDocument(cosDoc);
	             p2text = pdfStripper.getText(pdDoc);
	         } catch (Exception e) {
	             System.out.println("Exception occured while parsing the PDF Document.");
	             try {
	                    if (cosDoc != null) cosDoc.close();
	                    if (pdDoc != null) pdDoc.close();
	                } catch (Exception e1) {
	             return null;
	         return p2text;
	     // Write the parsed text from PDF to a file
	     void writeTexttoFile(String pdf2text, String fileName)throws Exception{
	    	 System.out.println("Writing PDF to text file " + fileName + "..");
	    	 PrintWriter out = new PrintWriter(fileName);
	public static void main(String args[]) throws Exception {
             Main2PDFParsing pdfObj = new Main2PDFParsing();
	         String pdfToText = pdfObj.pdftoText("camera-ready.pdf");
	         System.out.println("The text parsed from the PDF file.." + pdfToText);
	         pdfObj.writeTexttoFile(pdfToText, "PDFFile2TXT.txt");

If everything goes in the right direction you should see a txt file called PDFFile2TXT.txt in your project folder generated by the code. That is all the text Java finded in the pdf file.

As said in previously you can visualize the text as a word cloud in R for example and have a synthetic description of that text Visualize a word cloud in R

If you missed something in puting the pieces together then you can download the working project from Github here in this link.

In the analyzing and visualizing sections you find interesting IDEAS on how to analyse this data and visualize it to get insights and usefull information.

If you enjoyed this article guide please share


or follow ..

Follow @datamap2t