Picard is a set of Java command line tools for manipulating high-throughput sequencing (HTS) data files such as BAM and VCF. I needed to check the quality of thousands of BAM files, so I created a Bash script called picardmetrics. It runs 10 of the Picard tools on a BAM file and easily collates all of the generated metrics files into a single table. I also include utility scripts for generating the reference files required for Picard.
I made a data package with human transcription factor target genes for use in R. It is a collection of data from three sources: TRED, ITFP, and ENCODE. I use them to test if the targets of a transcription factor are differentially expressed in my data. Also, I can test if a set of transcription factor target genes is enriched for some gene set of interest.
I wrote a function using data.table to replace the default aggregate function in R. It runs about 100 times faster on my data (0.32 seconds instead of 33 seconds). I use it for microarray gene expression data, where I compute the mean expression values for genes that are represented by more than one probe on the microarray.
featureCounts, a read-counting program, requires identical mate ids to
identify a pair of read mates as correctly paired. However, FASTQ files
generated from an SRA file with fastq-dump have different mate ids for each
mate in a pair. The forward and reverse mate ids end with
respectively. I wrote a bash function to fix BAM files with this problem.
Before you can use the CollectRnaSeqMetrics Picard tool, you must create a table of genomic intervals with the coordinates of all ribosomal genes in the genome. I wrote a bash script to prepare this ribosomal interval file from Gencode gene annotations.
If you have multiple PLINK dosage files and would like to merge them into one file, this script might save you some time.