Extract data from a PDF file with Tabula

2018-12-29

#R, #Data, #Rheumatoid-arthritis, #Tutorials

Kirkham et al. 2006 is a prospective 2-year study of 60 patients with rheumatoid arthritis (RA). It shows that “synovial membrane cytokine mRNA expression is predictive of joint damage progression in RA”. The PDF includes a few tables with data on cytokine measurements and correlations with joint damage. Here, we’ll use Tabula to extract data from tables in the PDF file. Then we’ll make figures with R.

🐎 Tabula in action#

Download Tabula#

Go to tabula.technology and download the version for your operating system.

Mark each table with your mouse#

When you launch Tabula, you will see a new web site hosted at 127.0.0.1:8080. Open a PDF file to get started. With your mouse, click and drag to make a selection over each table in the PDF file. Don’t include extra things like headers or notes below the table, because they usually won’t be recognized correctly.

Export the data as a CSV or script#

You can export the data as a comma-separated-values (CSV) file, tab-separated-values (TSV) file, JSON, a zip file with multiple CSV files, or even as a shell script you can run from the command line.

For example, here is the shell script that I got by selecting Script as the export format:

java -jar tabula-java.jar  -a 118.958,100.598,327.803,510.638 -p 4 "$1"

With this script, we don’t need to launch the Tabula app. We can run the jar file from the command line, giving the coordinates of the table with -a and the page number with -p. The "$1" argument represents the name of the PDF file.

See tabula-java for more details about the Java command line application.