# Build bioinformatics pipelines with Snakemake

Snakemake is a Pythonic variant of GNU Make. Recently, I learned how to use it to build and launch bioinformatics pipelines on an LSF cluster. However, I had trouble understanding the documentation for Snakemake. I like to learn by trying simple examples, so this post will walk you through a very simple pipeline step by step. If you already know how to use Snakemake, then you might be interested to copy my Snakefiles for RNA-seq data analysis here.

# Compare distributions with box plots, not bar plots

In many scientific journals, authors use bar plots to compare two or more distributions. Often, the error bar is only present for the upper limit and not for the lower limit. Sometimes, the bar for the control group has no error bars due to data normalization. Here, I simulate a small experiment to illustrate why this normalization is problematic and why box plots are better than bar plots for comparing two distributions.

# Determine if a transcription factor is bound with CENTIPEDE

I wrote a practical tutorial for how to use CENTIPEDE to determine if a transcription factor is bound to a site in the genome. The tutorial explains how to prepare appropriate input data and how to run the analysis. Please get in touch if you have any comments or suggestions. In the future, I would like to incorporate the code from the tutorial into the CENTIPEDE R package.

# Print bigWig data for each region in a BED file

I wrote a Bash script to call bigWigToBedGraph for each region in a BED file. You can quickly take a subset of bigWig data for regions of interest. In my particular case, I needed to get phastCons conservation scores for putative transcription factor binding sites.

# Run Picard tools and collate multiple metrics files

Picard is a set of Java command line tools for manipulating high-throughput sequencing (HTS) data files such as BAM and VCF. I needed to check the quality of thousands of BAM files, so I created a Bash script called picardmetrics. It runs 10 of the Picard tools on a BAM file and easily collates all of the generated metrics files into a single table. I also include utility scripts for generating the reference files required for Picard.

# Get transcription factor target genes

I made a data package with human transcription factor target genes for use in R. It is a collection of data from three sources: TRED, ITFP, and ENCODE. I use them to test if the targets of a transcription factor are differentially expressed in my data. Also, I can test if a set of transcription factor target genes is enriched for some gene set of interest.

# Quickly aggregate your data with data.table in R

I wrote a function using data.table to replace the default aggregate function in R. It runs about 100 times faster on my data (0.32 seconds instead of 33 seconds). I use it for microarray gene expression data, where I compute the mean expression values for genes that are represented by more than one probe on the microarray.

# featureCounts requires identical mate ids

featureCounts, a read-counting program, requires identical mate ids to identify a pair of read mates as correctly paired. However, FASTQ files generated from an SRA file with fastq-dump have different mate ids for each mate in a pair. The forward and reverse mate ids end with .1 and .2, respectively. I wrote a bash function to fix BAM files with this problem.

# Make ribosomal RNA intervals for Picard CollectRnaSeqMetrics

Before you can use the CollectRnaSeqMetrics Picard tool, you must create a table of genomic intervals with the coordinates of all ribosomal genes in the genome. I wrote a bash script to prepare this ribosomal interval file from Gencode gene annotations.

# Join multiple PLINK dosage files into one file

If you have multiple PLINK dosage files and would like to merge them into one file, this script might save you some time.

# Autocomplete gene names with mygene.info and typeahead.js

I created an example showing how to use mygene.info with typeahead.js. It is possible to autocomplete gene names and retrieve every annotation you can think of (GO, Kegg, Ensembl, position, homologs, etc.).

# SNP Proxies

I made a simple web service to list proxies for SNPs in the 1000 Genomes Project. Check out an example of the JSON output. I used it to make an application powered by Highcharts for viewing the r2 linkage disequilibrium plot of any SNP in any population.

# Benchmark the binomial probability mass function in Python

It turns out that sympy and scipy have the slowest implementations of the binomial mass function. A pure Python version is just 30 times slower than C in my benchmark. Feel free to copy the code from the IPython notebook and test it for yourself.

# Create a quantile-quantile plot with ggplot2

After performing many tests for statistical significance, the next step is to check if any results are more significant than we would expect by random chance. One way to do this is by comparing the distribution of p-values from our tests to the uniform distribution with a quantile-quantile (QQ) plot. Here’s a function to create such a plot with ggplot2.

# How to ssh to a remote server without typing your password

Here are a few tips to use ssh more effectively. Login to your server using public key encryption instead of typing a password. Use the ~/.ssh/config file to create short and memorable aliases for your servers. Also, use aliases to connect through a login server into a work server.

# Count the number of coding base pairs in each Gencode gene

Use Python to count the coding base pairs in each Gencode gene. Here, the count is reported by gene rather than by transcript, so overlapping exons from multiple transcripts are merged before counting the base pairs.

# GTEx RNA-Seq Visualizations

I created three visualizations of RNA-Seq data from the GTEx project (version 2013-03-21). They’re powered by JBrowse, the WashU Epigenome Browser, and canvasXpress.

# 0-based and 1-based genomic intervals, overlap and distance

Here, I describe two kinds of genomic intervals and include source code for testing overlap and calculating distance between intervals.