Sometimes, I write notes to help my future self remember how to do things. I hope some of them are helpful, and please feel free to contact me if you have questions.

See notes grouped by tags or date.

Extract data from a PDF file with Tabula

Kirkham et al. 2006 is a prospective 2-year study of 60 patients with rheumatoid arthritis (RA). It shows that “synovial membrane cytokine mRNA expression is predictive of joint damage progression in RA”. The PDF includes a few tables with data on cytokine measurements and correlations with joint damage. Here, we’ll use Tabula to extract data from tables in the PDF file. Then we’ll make figures with R.

Make heatmaps in R with pheatmap

Here are a few tips for making heatmaps with the pheatmap R package by Raivo Kolde. We’ll use quantile color breaks, so each color represents an equal proportion of the data. We’ll also cluster the data with neatly sorted dendrograms, so it’s easy to see which samples are closely or distantly related.


Build bioinformatics pipelines with Snakemake

Snakemake is a Pythonic variant of GNU Make. Recently, I learned how to use it to build and launch bioinformatics pipelines on an LSF cluster. However, I had trouble understanding the documentation for Snakemake. I like to learn by trying simple examples, so this post will walk you through a very simple pipeline step by step. If you already know how to use Snakemake, then you might be interested to copy my Snakefiles for RNA-seq data analysis here.


Quickly aggregate your data in R with data.table

In genomics data, we often have multiple measurements for each gene. Sometimes we want to aggregate those measurements with the mean, median, or sum. The data.table R package can do this quickly with large datasets.

In this note, we compute the average of multiple measurements for each gene in a gene expression matrix.


Create a quantile-quantile plot with ggplot2

After performing many tests for statistical significance, the next step is to check if any results are more extreme than we would expect by random chance. One way to do this is by comparing the distribution of p-values from our tests to the uniform distribution with a quantile-quantile (QQ) plot. Here’s a function to create such a plot with ggplot2.


How to ssh to a remote server without typing your password

Here are a few tips to use ssh more effectively. Login to your server using public key encryption instead of typing a password. Use the ~/.ssh/config file to create short and memorable aliases for your servers. Also, use aliases to connect through a login server into a work server.