May 09 2008
extract text from a pdf
The other day, one of my developer friend asked me for a trivial function. At least it looked trivial to start with. The task was to simply read the raw text of a PDF file using c#.. a small function to do this. No formatting, no image considerations.. just a plain dump of the PDF text.
My default approach was to go for acrobat32.exe and read the text. However, I realized that an adobe installation, com server registration, other unforeseen grumpy issues etc, may hinder the use of the exe. Also, I wasn’t sure of the free adobe reader acrobat32.exe to do the task. I browsed through some other PDF libraries.
The PDFBox for .NET built using IKVM seemed apt for the simple task at hand.
Here are the steps to get the PDF text:
- Add reference to the IKVM.GNU.Classpath.dll and PDFBox-0.7.3.dll libraries in your project.
- Place the IKVM.Runtime.dll and FontBox-0.1.0-dev.dll libraries in your bin folder. It seems IKVM.Runtime.dll is always required to be present in the runtime folder, but need not be referenced. The other DLL depends on the PDF, you are trying to read. In my case, I needed the FontBox-0.1.0-dev.dll to be present. (the WA DOL driving guide PDF)
- Use the following method to get the text:
private static string GetPdfText(string pdfFilePath) { string extractedPdfText = null; org.pdfbox.pdmodel.PDDocument pdfDocument = org.pdfbox.pdmodel.PDDocument.load(pdfFilePath); if (null != pdfDocument) { org.pdfbox.util.PDFTextStripper pdfTextStripper = new org.pdfbox.util.PDFTextStripper(); // actual extraction of the text. extractedPdfText = pdfTextStripper.getText(pdfDocument); } return extractedPdfText; }
This open source PDF library provides a host of PDF manipulation capabilities. If you are looking for PDF document merging, page separation, indexes, bookmark reading etc, this is the library. My friend was happy with the small method.. I was happy that someone built a .NET version of this good java PDF library.
there’s a solution to every problem; given enough time and money..