Converting PDFs to Usable Data

Note: This guide only covers the better known, more useful methods. There are dozens of programs and websites you'll find if you do a Google search for "convert PDF to Excel"

But the focus of the guide is on the general strategies and insight about data so that you can make the best decision for your own situation.

Portable and Printable

Adobe's ubiquitous format. Great for sending around to someone's printer.

Excerpt from Valve Handbook for New Employees

Only an appearance of data

But without additional work on your part, a table of data inside a PDF is just as inert as if it were on printed paper.

You can't sort, sum, or sift through it.

Sample of NYPD COMPSTAT report

Why we care

Before the widespread use of the Web, we filled out paper forms and our databases generated paper printouts.

There's a lot of software to move those "papers" online in PDF format.

Thus, there's lots of data locked up in the PDF format. Most of it isn't hard to get at, if you know some basic tools and strategies.

Extracting the Data

We'll examine some basic techniques and tools to convert PDF data from this [link to original file] [local version]:

Into a usable, sortable spreadsheet (xls):

Two basic types of PDFs

PDFs with actual text
PDFs composed of only images

They each require fundamentally different approaches; image-PDFs are much harder to work with.

Text-based PDFs

These are created by software that convert text-based formats – such as Microsoft Word documents and spreadsheets – into Adobe's format.

The entire document is typically less than 10MB in file size –unless it's thousands of pages or contains images (illustrations, photos) with the text
Text remains sharp as you zoom in.

Examples: 1, 2, 3

Image-based PDFs

These typically are scanned documents – i.e. photos of documents. To humans, it may look like it contains real text. But it's just a photo to the computer.

Obvious scan/dirt/wrinkle marks
Large size: each single page – i.e. each photo of a page – can be several megabytes in filesize
The text gets blurry when zoomed in.

Examples: 1

Easiest way to tell them apart?

Click and Drag

Lines and words can be highlighted (link)

Nothing to highlight (link)

Getting actual text-data from image-based PDFs requires all the work of extracting from text-PDFs, plus an extra set of tools to use beforehand. We'll cover this process later.

Highlight, copy, and paste

The natural thing to try (example pdf).

Drag and highlight

Pros:

Sure, why not? You have a mouse.

Cons:

Almost always fails.
Even if it did work, you'd probably get carpal tunnel syndrome.

Note: In Adobe Acrobat, you can highlight text and select an option to convert to table. However, Acrobat costs money and the feature can be inconsistent.

Adobe Acrobat export to XML/HTML

Sometimes the PDF is built with structured XML or HTML. Acrobat and other programs can successfully extract and preserve this structure. But you might need to do some programming to extract it (check out Nokogiri for Ruby and Beautiful Soup for Python).

Adobe Acrobat convert to HTML

Pros:

Easy to attempt
HTML/XML format might be exactly accurate.

Cons:

Have to buy Adobe Acrobat
Have to know of a way to parse HTML/XML
Output might be totally useless.

Use a Third Party Service

Several cloud-based services allow you to upload a PDF. The service processes it and sends you a link to the spreadsheet file by email.

Example services include Cometdocs.com and Zamzar.com Most of these services are free with a "pro" option.

Uploading a file to CometDocs.com

The service sends you an email with download link after a few minutes (sometimes longer).

[example result]

Using Third Party Services

Pros:

Easy
Usually very accurate
Usually free, so worth doing as a comparison

Cons:

Have to expose data to a third party
Have to abide by terms of service, privacy policy, rate limits, etc.
Still not 100% accurate, so be watchful for quirks

Doing it yourself

Each PDF-creating program has a different way of constructing it.
Each PDF-translating service also has its own algorithm for decoding them.
The results can be inconsistent.

When every digit and character is critical, you may have to do a little manual work.

Using pdftotext

pdftotext is part of the free Xpdf toolkit

You run it from the command-line with the -layout option:

And you essentially get the text as-is (PDF / output text):

Delimiting raw text

So how do you get the raw text output from pdftotext into Excel?

Spreadsheets essentially consist of raw text with some kind of common character – a delimiter – that separates each column.

Common delimiters include:

Commas
Pipe characters |
Tabs

The raw text created by pdftotext isn't delimited because of the varying number of spaces between each column
(link to text).

You can either:

Try to re-arrange the columns by hand (and risk losing your sanity)
Learn regular expressions (the most powerful non-programming, data-related concept that anyone can easily learn).

Looking for the pattern

You can easily see that each column is separated by at least two or more spaces.

The regular expression pattern to find:
{2,}
...there's a space before the {

Replace with the delimiter of your choice. Here's a tab:
\t

The result: delimited text file

Regular expressions are like find-and-replace in your text-editor:

Free text-editors that can do this: TextWrangler for Mac, Notepad++ for Windows

Try it out interactively: http://regexr.com?30np5

Not text, images of text

When a PDF is composed of images, the text you might see is not seen by the computer as actual text. Programs such as pdftotext will not work because there is no text to extract.

Therefore, an image-PDF of data tables won't contain data. Just as this cat photo doesn't actually contain the word, "cat":

Optical character recognition

Computers can be trained to recognize faces. They can also be trained to recognize lettering:

Easy	Hard
Example of Face.com's face-detection API	Photo from White House Flickr

Its accuracy depends on the quality of the image and the program's training.

Easy	Hard

Tesseract

This is a free, command-line OCR program maintained by Google. It works well out-of-the-box and – with some work – can be trained for specific character sets.

Tesseract sample results

A sample page from Edward Tufte's Introduction to Data Analysis (download image)

The Tesseract results: mostly accurate (download text)

Tesseract, pros and cons

Pros:

Free
Pretty good quality for being free
Trainable
Easy to include as part of custom programming scripts

Cons:

Not straightforward to install
Image files need to be of reasonably good quality
Training is pretty cumbersome

Adobe Acrobat

This commercial program has OCR built-into-it.

Pros:

Pretty good quality
Tries to keep text in roughly the same layout
Integrated with Acrobat's other PDF tools.

Cons:

Have to buy Acrobat

Google Docs

Google Docs can perform OCR on uploaded images and PDFs

Pros:

Free
Convenient
Multiple languages

Cons:

Have to expose files to Google and abide by terms of service and upload limits

Amazon's Mechanical Turk

Have hundreds, thousands of humans to transcribe your images. Amazon provides a way for you to send micro-tasks (e.g. "Type out the text in the third row of this table") to users looking for easy, quick jobs.

Amazon's Mechanical Turk

Pros:

You can be as detailed and meticulous as you want.
Humans are good at doing simplified tasks that might befuddle a computer (e.g. "what animal is in that picture?")

Cons:

Requires some programming skill (to divide up and organize the documents) for this to be manageable
Even humans make mistakes
Adding in layers of double-checking, etc. will get costly

Programming

Every tool covered so far works mostly as-is, no programming required.

If you know some programming, it can be helpful to "glue" together a bunch of repetitive tasks.

Examples:

Quickly comparing the PDF conversion of one service to another, such as Cometdocs.com vs Zamzar.com
Downloading hundreds or thousands of PDFs from a site and running pdftotext/Tesseract on them.
Breaking up documents into smaller pieces into cheaper tasks Mechanical Turk users (translate one cell of a table rather than the whole table)

Programming Example: NYPD Crime Data

Though New York City is famed for its use of statistics in fighting crime, the department publishes very little data on its website.

Data is contained in text-based PDFs
One PDF file per each of the ~80 precincts
The site publishes only one week's worth of stats at a time

The data format

Only the highlighted info is real "data"

Doing this by hand – downloading every PDF file and entering even just the few numbers each contains, every single week – is prone to error and brain atrophy.

A programming script can automate every step, turning these PDFs into data in just a minute. ScraperWiki has a working example

Strategy for Scraping NYPD COMPSTAT PDFs (Step 1/4)

Download each PDF link on the NYPD stats homepage using simple web scraping (Nokogiri for Ruby, Beautiful Soup for Python)

Strategy for scraping NYPD COMPSTAT PDFs (Step 2/4)

Convert PDF to text. Using pdftotext is an option.

Sample PDF report / text output

Strategy for scraping NYPD COMPSTAT PDFs (Step 3/4)

Use text-matching (regular expressions) to capture data points. (interactive link)

Example: Robbery\s{2,}\d+

Strategy for scraping NYPD COMPSTAT PDFs (Step 4/4)

Save as spreadsheet (comma/tab-delimited) format.

Now we can sort/search and look at statistics over time.

Again, check out ScraperWiki's recipe for the NYPD data.

The Big Picture

Data is sometimes found only in PDF format.

If the PDF is text (i.e. you can highlight text in it) Use a third-party service like Cometdocs or Zamzar
If the PDF is composed of images (no text is highlightable) Use OCR programs like Tesseract, Google Docs or Adobe Acrobat
Learning regular expressions can help you clean/organize text faster.
Learning programming can help you do all of the above x number of times quickly.

Lots of teach-yourself-books, including Learn Python the Hard Way and Bastards Book of Ruby
ScraperWiki has a huge repository of scraping code.

From PDFs to Usable Data

Portable and Printable

Only an appearance of data

Why we care

Extracting the Data

Two basic types of PDFs

Text-based PDFs

Image-based PDFs

Easiest way to tell them apart?

Click and Drag

Extracting Data from Text PDFs

Highlight, copy, and paste

Drag and highlight

Adobe Acrobat export to XML/HTML

Adobe Acrobat convert to HTML

Use a Third Party Service

Using Third Party Services

Doing it yourself

Using pdftotext

Delimiting raw text

Looking for the pattern

Image based PDFs

Not text, images of text

Optical character recognition

Tesseract

Tesseract sample results

Tesseract, pros and cons

Adobe Acrobat

Google Docs

Amazon's Mechanical Turk

Amazon's Mechanical Turk

Programming

Programming Example: NYPD Crime Data

The data format

Strategy for Scraping NYPD COMPSTAT PDFs (Step 1/4)

Strategy for scraping NYPD COMPSTAT PDFs (Step 2/4)

Strategy for scraping NYPD COMPSTAT PDFs (Step 3/4)

Strategy for scraping NYPD COMPSTAT PDFs (Step 4/4)

The Big Picture

From PDFs to Usable Data