Find the most abundant barcodes in FASTQ files
Single-cell RNA-seq data contains oligonucleotide barcodes to uniquely identify each multiplexed sample, each single cell, and each individual molecule. Can we check which barcodes are present in a given FASTQ file? Maybe we can guess which 10x sample index was used during library preparation?
An example cellranger output folder#
Suppose we ran
cellranger mkfastq to generate an output folder like this:
$ ls mydata/data_output/210804_SL-NVF_0123_AHF5K3DSXY_fastqs/fastq_path HF5K3DSXY/ Reports/ Stats/ Undetermined_S0_L001_I1_001.fastq.gz Undetermined_S0_L001_R1_001.fastq.gz Undetermined_S0_L001_R2_001.fastq.gz Undetermined_S0_L002_I1_001.fastq.gz Undetermined_S0_L002_R1_001.fastq.gz Undetermined_S0_L002_R2_001.fastq.gz Undetermined_S0_L003_I1_001.fastq.gz Undetermined_S0_L003_R1_001.fastq.gz Undetermined_S0_L003_R2_001.fastq.gz Undetermined_S0_L004_I1_001.fastq.gz Undetermined_S0_L004_R1_001.fastq.gz Undetermined_S0_L004_R2_001.fastq.gz
HF5K3DSXY/ we will find one folder for each sample of demultiplexed data:
$ ls HF5K3DSXY batch1 batch2 batch3 batch4
HF5K3DSXY/batch1 we will find the FASTQ files for that sample:
$ ls HF5K3DSXY/batch1 batch1_S1_L001_I1_001.fastq.gz batch1_S1_L001_R1_001.fastq.gz batch1_S1_L001_R2_001.fastq.gz batch1_S1_L002_I1_001.fastq.gz batch1_S1_L002_R1_001.fastq.gz batch1_S1_L002_R2_001.fastq.gz batch1_S1_L003_I1_001.fastq.gz batch1_S1_L003_R1_001.fastq.gz batch1_S1_L003_R2_001.fastq.gz batch1_S1_L004_I1_001.fastq.gz batch1_S1_L004_R1_001.fastq.gz batch1_S1_L004_R2_001.fastq.gz
If we suspect that these FASTQ files are a bit smaller than they should be,
then we may want to consider the possibility that we ran
cellranger with the
wrong sample index barcodes.
If we ran
cellranger with a sample sheet that contained the wrong barcodes,
then the reads that did not match the barcodes will have been saved into the
Let's have a look at one of them:
$ gzip -cd Undetermined_S0_L001_I1_001.fastq.gz | head @A00442:381:HF5K3DSXY:1:1101:2239:1016 1:N:0:NGGGGGGG NGGGGGGG + #FFF:F:F @A00442:381:HF5K3DSXY:1:1101:2709:1016 1:N:0:NGGGGGGG NGGGGGGG + #FFFFFFF @A00442:381:HF5K3DSXY:1:1101:3902:1016 1:N:0:NTAAGGTA NTAAGGTA
The last colon-delimited entry in each line has the sample barcode (e.g.,
To learn more about the FASTQ file formats, read the bcl2fastq documention.
Count sample barcodes in a FASTQ file#
Here's one way to count the abundance of each barcode and list the top 10 most abundant barcodes:
$ gzip -cd Undetermined_S0_L001_I1_001.fastq.gz \ | grep '^@' | cut -d: -f10 | sort | uniq -c | sort -k1rn | head -n10 35643135 GCGTACAC 31411951 ATTGCGTG 26271617 CGACTTGA 22104605 TACAGACT 6264332 AGGATCGA 5934908 CACGATTC 5873602 TCTCGACT 5714926 GTATCGAG 1774897 GGGGGGGG 184548 TGCGAACT
Each of the top 4 barcodes have 22-35M reads, and the 5th one has just 6M reads.
If we look for the top 4 barcodes in the index files from 10x Genomics, here's what we find:
# Single_Index_Kit_T_Set_A.csv SI-GA-C5,CGACTTGA,TACAGACT,ATTGCGTG,GCGTACAC
Since all four of the most abundant barcodes from the FASTQ file match with the
SI-GA-C5sequences, we might conclude that
SI-GA-C5is the correct barcode to use for this data.
Since we have a large number (22-35M) of reads in the
Undetermined_*.fastq.gzfiles, it is likely that our sample sheet was wrong when we ran
cellranger mkfastq. We should re-run with a corrected sample sheet.