Find the most abundant barcodes in FASTQ files

2021-04-22

Single-cell RNA-seq data contains oligonucleotide barcodes to uniquely identify each multiplexed sample, each single cell, and each individual molecule. Can we check which barcodes are present in a given FASTQ file? Maybe we can guess which 10x sample index was used during library preparation?

An example cellranger output folder#

Suppose we ran cellranger mkfastq to generate an output folder like this:

$ ls mydata/data_output/210804_SL-NVF_0123_AHF5K3DSXY_fastqs/fastq_path

HF5K3DSXY/
Reports/
Stats/
Undetermined_S0_L001_I1_001.fastq.gz
Undetermined_S0_L001_R1_001.fastq.gz
Undetermined_S0_L001_R2_001.fastq.gz
Undetermined_S0_L002_I1_001.fastq.gz
Undetermined_S0_L002_R1_001.fastq.gz
Undetermined_S0_L002_R2_001.fastq.gz
Undetermined_S0_L003_I1_001.fastq.gz
Undetermined_S0_L003_R1_001.fastq.gz
Undetermined_S0_L003_R2_001.fastq.gz
Undetermined_S0_L004_I1_001.fastq.gz
Undetermined_S0_L004_R1_001.fastq.gz
Undetermined_S0_L004_R2_001.fastq.gz

Inside HF5K3DSXY/ we will find one folder for each sample of demultiplexed data:

$ ls HF5K3DSXY

batch1
batch2
batch3
batch4

Inside HF5K3DSXY/batch1 we will find the FASTQ files for that sample:

$ ls HF5K3DSXY/batch1

batch1_S1_L001_I1_001.fastq.gz
batch1_S1_L001_R1_001.fastq.gz
batch1_S1_L001_R2_001.fastq.gz
batch1_S1_L002_I1_001.fastq.gz
batch1_S1_L002_R1_001.fastq.gz
batch1_S1_L002_R2_001.fastq.gz
batch1_S1_L003_I1_001.fastq.gz
batch1_S1_L003_R1_001.fastq.gz
batch1_S1_L003_R2_001.fastq.gz
batch1_S1_L004_I1_001.fastq.gz
batch1_S1_L004_R1_001.fastq.gz
batch1_S1_L004_R2_001.fastq.gz

If we suspect that these FASTQ files are a bit smaller than they should be, then we may want to consider the possibility that we ran cellranger with the wrong sample index barcodes.

If we ran cellranger with a sample sheet that contained the wrong barcodes, then the reads that did not match the barcodes will have been saved into the Undetermined_*.fastq.gz files.

Let's have a look at one of them:

$ gzip -cd Undetermined_S0_L001_I1_001.fastq.gz | head

@A00442:381:HF5K3DSXY:1:1101:2239:1016 1:N:0:NGGGGGGG
NGGGGGGG
+
#FFF:F:F
@A00442:381:HF5K3DSXY:1:1101:2709:1016 1:N:0:NGGGGGGG
NGGGGGGG
+
#FFFFFFF
@A00442:381:HF5K3DSXY:1:1101:3902:1016 1:N:0:NTAAGGTA
NTAAGGTA

The last colon-delimited entry in each line has the sample barcode (e.g., NGGGGGGG, NTAAGGTA).

To learn more about the FASTQ file formats, read the bcl2fastq documention.

Count sample barcodes in a FASTQ file#

Here's one way to count the abundance of each barcode and list the top 10 most abundant barcodes:

$ gzip -cd Undetermined_S0_L001_I1_001.fastq.gz \
 | grep '^@' | cut -d: -f10 | sort | uniq -c | sort -k1rn | head -n10

35643135 GCGTACAC
31411951 ATTGCGTG
26271617 CGACTTGA
22104605 TACAGACT
6264332 AGGATCGA
5934908 CACGATTC
5873602 TCTCGACT
5714926 GTATCGAG
1774897 GGGGGGGG
184548 TGCGAACT

Each of the top 4 barcodes have 22-35M reads, and the 5th one has just 6M reads.

If we look for the top 4 barcodes in the index files from 10x Genomics, here's what we find:

# Single_Index_Kit_T_Set_A.csv
SI-GA-C5,CGACTTGA,TACAGACT,ATTGCGTG,GCGTACAC

Conclusions#

  • Since all four of the most abundant barcodes from the FASTQ file match with the SI-GA-C5 sequences, we might conclude that SI-GA-C5 is the correct barcode to use for this data.

  • Since we have a large number (22-35M) of reads in the Undetermined_*.fastq.gz files, it is likely that our sample sheet was wrong when we ran cellranger mkfastq. We should re-run with a corrected sample sheet.