Build bioinformatics pipelines with Snakemake
Snakemake is a Pythonic variant of GNU Make. Recently, I learned how to use it to build and launch bioinformatics pipelines on an LSF cluster. However, I had trouble understanding the documentation for Snakemake. I like to learn by trying simple examples, so this post will walk you through a very simple pipeline step by step. If you already know how to use Snakemake, then you might be interested to copy my Snakefiles for RNA-seq data analysis here.
Count the number of coding base pairs in each Gencode gene
We can use Python to count the coding base pairs in each Gencode gene. Here, we report the base pair count by gene rather than by transcript. When we encounter different transcripts for the same gene with overlapping exons, we only count those base pairs once rather than multiple times.
0-based and 1-based genomic intervals, overlap, and distance
Here, I describe two kinds of genomic intervals and include source code for testing overlap and calculating distance between intervals.