From PDFs to Usable Data

for the International Journalism Festival 2012


Dan Nguyen twitter: @dancow / @propublica
April 26, 2012
Shortlink: http://bit.ly/pdftodata

Note: This guide only covers the better known, more useful methods. There are dozens of programs and websites you'll find if you do a Google search for "convert PDF to Excel"

But the focus of the guide is on the general strategies and insight about data so that you can make the best decision for your own situation.

Portable and Printable

Adobe's ubiquitous format. Great for sending around to someone's printer.

Only an appearance of data

But without additional work on your part, a table of data inside a PDF is just as inert as if it were on printed paper.

You can't sort, sum, or sift through it.

Why we care

Before the widespread use of the Web, we filled out paper forms and our databases generated paper printouts.

There's a lot of software to move those "papers" online in PDF format.

Thus, there's lots of data locked up in the PDF format. Most of it isn't hard to get at, if you know some basic tools and strategies.

Extracting the Data

We'll examine some basic techniques and tools to convert PDF data from this [link to original file] [local version]:

Excerpt from GSK payments list

Into a usable, sortable spreadsheet (xls):

Two basic types of PDFs

  1. PDFs with actual text
  2. PDFs composed of only images

They each require fundamentally different approaches; image-PDFs are much harder to work with.

Text-based PDFs

Excerpt from ViiV payments list

These are created by software that convert text-based formats – such as Microsoft Word documents and spreadsheets – into Adobe's format.

Examples: 1, 2, 3

Image-based PDFs

A scanned file

These typically are scanned documents – i.e. photos of documents. To humans, it may look like it contains real text. But it's just a photo to the computer.

Examples: 1

Easiest way to tell them apart?

Click and Drag

Lines and words can be highlighted (link)

Highlightable PDF

Nothing to highlight (link)

Non highlightable

Getting actual text-data from image-based PDFs requires all the work of extracting from text-PDFs, plus an extra set of tools to use beforehand. We'll cover this process later.

Extracting Data from Text PDFs

Highlight, copy, and paste

The natural thing to try (example pdf).

Drag and highlight

Pros:

Cons:

Note: In Adobe Acrobat, you can highlight text and select an option to convert to table. However, Acrobat costs money and the feature can be inconsistent.

Adobe Acrobat export to XML/HTML

Sometimes the PDF is built with structured XML or HTML. Acrobat and other programs can successfully extract and preserve this structure. But you might need to do some programming to extract it (check out Nokogiri for Ruby and Beautiful Soup for Python).

Adobe Acrobat convert to HTML

Pros:

Cons:

Use a Third Party Service

Several cloud-based services allow you to upload a PDF. The service processes it and sends you a link to the spreadsheet file by email.

Example services include Cometdocs.com and Zamzar.com Most of these services are free with a "pro" option.

Comet Docs
Uploading a file to CometDocs.com

The service sends you an email with download link after a few minutes (sometimes longer).

[example result]

email

Using Third Party Services

Pros:

Cons:

Doing it yourself

  1. Each PDF-creating program has a different way of constructing it.
  2. Each PDF-translating service also has its own algorithm for decoding them.
  3. The results can be inconsistent.

When every digit and character is critical, you may have to do a little manual work.

Terminal output

Using pdftotext

pdftotext is part of the free Xpdf toolkit

You run it from the command-line with the -layout option:

command prompt

And you essentially get the text as-is (PDF / output text):

Pfizer lines
Pfizer text lines

Delimiting raw text

So how do you get the raw text output from pdftotext into Excel?

Spreadsheets essentially consist of raw text with some kind of common character – a delimiter – that separates each column.

Common delimiters include:

The raw text created by pdftotext isn't delimited because of the varying number of spaces between each column
(link to text).

columns

You can either:

Looking for the pattern

You can easily see that each column is separated by at least two or more spaces.

The regular expression pattern to find:
{2,}
...there's a space before the {

Replace with the delimiter of your choice. Here's a tab:
   \t

The result: delimited text file

Regular expressions are like find-and-replace in your text-editor:

Free text-editors that can do this: TextWrangler for Mac, Notepad++ for Windows

Try it out interactively: http://regexr.com?30np5

[interactive link] / delimited text file

Image based PDFs

Not text, images of text

When a PDF is composed of images, the text you might see is not seen by the computer as actual text. Programs such as pdftotext will not work because there is no text to extract.

Therefore, an image-PDF of data tables won't contain data. Just as this cat photo doesn't actually contain the word, "cat":

Optical character recognition

Computers can be trained to recognize faces. They can also be trained to recognize lettering:

EasyHard

Its accuracy depends on the quality of the image and the program's training.

EasyHard

Tesseract

This is a free, command-line OCR program maintained by Google. It works well out-of-the-box and – with some work – can be trained for specific character sets.

running Tesseract

Tesseract sample results

A sample page from Edward Tufte's Introduction to Data Analysis (download image)
The Tesseract results: mostly accurate (download text)

Tesseract, pros and cons

Pros:

Cons:

Adobe Acrobat

This commercial program has OCR built-into-it.

Pros:

Cons:

Google Docs

Google Docs can perform OCR on uploaded images and PDFs

Pros:

Cons:

Amazon's Mechanical Turk

Have hundreds, thousands of humans to transcribe your images. Amazon provides a way for you to send micro-tasks (e.g. "Type out the text in the third row of this table") to users looking for easy, quick jobs.

Amazon's Mechanical Turk

Pros:

Cons:

Programming

Every tool covered so far works mostly as-is, no programming required.

If you know some programming, it can be helpful to "glue" together a bunch of repetitive tasks.

Examples:

Programming Example: NYPD Crime Data

Though New York City is famed for its use of statistics in fighting crime, the department publishes very little data on its website.

The data format

Only the highlighted info is real "data"

Doing this by hand – downloading every PDF file and entering even just the few numbers each contains, every single week – is prone to error and brain atrophy.

A programming script can automate every step, turning these PDFs into data in just a minute. ScraperWiki has a working example

Strategy for Scraping NYPD COMPSTAT PDFs (Step 1/4)

Download each PDF link on the NYPD stats homepage using simple web scraping (Nokogiri for Ruby, Beautiful Soup for Python)

NYPD homepage

Strategy for scraping NYPD COMPSTAT PDFs (Step 2/4)

Convert PDF to text. Using pdftotext is an option.

Sample PDF report / text output

Sample pdftotext conversion

Strategy for scraping NYPD COMPSTAT PDFs (Step 3/4)

Use text-matching (regular expressions) to capture data points. (interactive link)

Example: Robbery\s{2,}\d+

Sample regex

Strategy for scraping NYPD COMPSTAT PDFs (Step 4/4)

Save as spreadsheet (comma/tab-delimited) format.

Now we can sort/search and look at statistics over time.

Sample spreadsheet

Again, check out ScraperWiki's recipe for the NYPD data.

The Big Picture

Data is sometimes found only in PDF format.

From PDFs to Usable Data

for the International Journalism Festival 2012


Dan Nguyen twitter: @dancow / @propublica
April 26, 2012
Shortlink: http://bit.ly/pdftodata

Slide format courtesy of Google's html5slides