Extracting Data from PDFs Goes Big

February 12, 2014
FOXITBLOG

Extracting useful information from PDFs can be a challenge when you’re talking about a gigantic number of PDF documents. Which is why the Sunlight PDF Liberation Hackathon took place. Unlike its name, the hackathon was not about breaking into anyone’s private database of PDF documents but rather, was dedicated to improving tools for PDF extraction.

Why the need? There are many organizations, including public interest groups, that want to search PDF documents en masse.

Everyday Examples of Extracting Data from PDFs

For example, one of the Foundation’s challenges centered on the financial performance of the nation’s major cities. Most large US cities publish Comprehensive Annual Financial Reports (CAFRs) in the form of PDFs. These documents contain a large set of audited financial statements with footnotes. The challenge was to extract a single statement – a ten-year history of revenues and expenditures – from the latest CAFR for four cities (Chicago, New York, San Francisco, and Washington DC) participating in the hackathon. The results enable comparison of revenue sources and spending priorities across cities over a number of years—an obvious benefit to local, state and national governmental agencies, not to mention taxpayers.

As another example, Members of the House of Representatives file a yearly report on their personal finances. Though this report is often submitted electronically, it is only made available in PDF form on the Clerk of the House’s Website. The challenge was to find a reliable and sustainable way to extract the information entered on the form, which shifts with downloads and content.

As such, Sunlight’s PDF Liberation Hackathon aimed to tackle real-world PDF data extraction problems and bring coders together to add features, extensions, and plugins to existing PDF extraction frameworks, making them more flexible, useful, and sustainable.

More Information on How to Extract Content from PDF

Developers interested in furthering the research may want to take a look at the Foxit Embedded PDF Software Development Kit (SDK).

The industry leading PDF SDK is targeted to developers, device manufacturers, and telecom carriers who support PDF applications that leverage powerful, standard-compliant PDF technology to securely display, search, and annotate PDF documents and to fill PDF forms. Developers can use the SDK to search for specific text in PDF documents and then extract the content. They can then parse and save the extracted text. Click here for more information on the SDK.

Extracting Data from PDFs Goes Big

Leave a Reply Cancel reply