PDFBox is an open source project released under the BSD license. It is a pure Java library that supports following important features,

  • Extracting text from PDF document.
  • Encrypt/decrypt any PDF document.
  • Import/export form data
  • Overlay one PDF document on top of another.
  • Split big PDF document into multiple small documents.
  • Append to existing PDF document.
  • Integrate with Jakarta Lucene.

PDFBox API belong to two different packages,

  1. org.pdfbox.cos - represent PDF document as collection of basic objects types.
  2. org.pdfbox.pdfmodel - It encapsulates COS model  and provides high level API

Here is an example code to start with this library,

import java.io.IOException;
import java.io.StringWriter;
 
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
 
public class PDFExtractor{
 
    public String getPdfText(String fileName, int start_page, int end_page) throws IOException {
        StringWriter s = new StringWriter();
        PDDocument doc = null;
 
        try {
            doc = PDDocument.load(fileName);
          
            PDFTextStripper stripper = new PDFTextStripper();               
            stripper.setStartPage( start_page );
            stripper.setEndPage( end_page );
 
            stripper.writeText(doc, s);
        } finally {
             if (doc != null) {
                 doc.close();
             }
        }
       
       return s.toString();
  }
}