May 09 2008

extract text from a pdf

Published by Raja Nadar at 4:39 am under .net, c#, ikvm, pdfbox

The other day, one of my developer friend asked me for a trivial function.  At least it looked trivial to start with. The task was to simply read the raw text of a PDF file using c#.. a small function to do this. No formatting, no image considerations.. just a plain dump of the PDF text.

My default approach was to go for acrobat32.exe and read the text. However, I realized that an adobe installation, com server registration, other unforeseen grumpy issues etc, may hinder the use of the exe. Also, I wasn’t sure of the free adobe reader acrobat32.exe to do the task. I browsed through some other PDF libraries.

The PDFBox for .NET built using IKVM seemed apt for the simple task at hand. 

Here are the steps to get the PDF text:

  • Add reference to the IKVM.GNU.Classpath.dll and PDFBox-0.7.3.dll libraries in your project.
  • Place the IKVM.Runtime.dll and FontBox-0.1.0-dev.dll libraries in your bin folder. It seems IKVM.Runtime.dll is always required to be present in the runtime folder, but need not be referenced. The other DLL depends on the PDF, you are trying to read. In my case, I needed the FontBox-0.1.0-dev.dll to be present. (the WA DOL driving guide PDF)
  • Use the following method to get the text: 

 

    private static string GetPdfText(string pdfFilePath)
    {
        string extractedPdfText = null;
 
        org.pdfbox.pdmodel.PDDocument pdfDocument = org.pdfbox.pdmodel.PDDocument.load(pdfFilePath);
 
        if (null != pdfDocument)
        {
            org.pdfbox.util.PDFTextStripper pdfTextStripper = new org.pdfbox.util.PDFTextStripper();
 
            // actual extraction of the text.
            extractedPdfText = pdfTextStripper.getText(pdfDocument);
        }
 
        return extractedPdfText;
    }

 

This open source PDF library provides a host of PDF manipulation capabilities. If you are looking for PDF document merging, page separation, indexes, bookmark reading etc, this is the library. My friend was happy with the small method.. I was happy that someone built a .NET version of this good java PDF library.

there’s a solution to every problem; given enough time and money..

2 Responses to “extract text from a pdf”

  1. Abhijeet Maharanaon 10 May 2008 at 10:43 am

    Isn’t the bin folder cleared and repopulated every time you build the project? As such, the dlls will be gone on the second build. I think the dlls should be placed in another folder and their “build action” property modified to have them copied to the output folder on each build.

    What do you think?

  2. Raja Nadaron 10 May 2008 at 1:04 pm

    you are partially correct. i believe the bin folder is not cleared completely, at least by VS. it doesn’t delete any external dlls, which are not generated by it.

    in any case, what you mention should be the correct dev time behavior (put the dlls in an external folder and copy them in the build action). the deployment time behavior is to obviously have your script/msi take care of the copying process.

Trackback URI | Comments RSS

Leave a Reply