Find the most abundant barcodes in FASTQ files
Single-cell RNA-seq data contains oligonucleotide barcodes to uniquely identify each multiplexed sample, each single cell, and each individual molecule. Can we check which barcodes are present in a given FASTQ file? Maybe we can guess which 10x sample index was used during library preparation?
An example cellranger output folder#
Suppose we ran cellranger mkfastq
to generate an output folder like this:
$ ls mydata/data_output/210804_SL-NVF_0123_AHF5K3DSXY_fastqs/fastq_path
HF5K3DSXY/
Reports/
Stats/
Undetermined_S0_L001_I1_001.fastq.gz
Undetermined_S0_L001_R1_001.fastq.gz
Undetermined_S0_L001_R2_001.fastq.gz
Undetermined_S0_L002_I1_001.fastq.gz
Undetermined_S0_L002_R1_001.fastq.gz
Undetermined_S0_L002_R2_001.fastq.gz
Undetermined_S0_L003_I1_001.fastq.gz
Undetermined_S0_L003_R1_001.fastq.gz
Undetermined_S0_L003_R2_001.fastq.gz
Undetermined_S0_L004_I1_001.fastq.gz
Undetermined_S0_L004_R1_001.fastq.gz
Undetermined_S0_L004_R2_001.fastq.gz
Inside HF5K3DSXY/
we will find one folder for each sample of demultiplexed data:
$ ls HF5K3DSXY
batch1
batch2
batch3
batch4
Inside HF5K3DSXY/batch1
we will find the FASTQ files for that sample:
$ ls HF5K3DSXY/batch1
batch1_S1_L001_I1_001.fastq.gz
batch1_S1_L001_R1_001.fastq.gz
batch1_S1_L001_R2_001.fastq.gz
batch1_S1_L002_I1_001.fastq.gz
batch1_S1_L002_R1_001.fastq.gz
batch1_S1_L002_R2_001.fastq.gz
batch1_S1_L003_I1_001.fastq.gz
batch1_S1_L003_R1_001.fastq.gz
batch1_S1_L003_R2_001.fastq.gz
batch1_S1_L004_I1_001.fastq.gz
batch1_S1_L004_R1_001.fastq.gz
batch1_S1_L004_R2_001.fastq.gz
If we suspect that these FASTQ files are a bit smaller than they should be,
then we may want to consider the possibility that we ran cellranger
with the
wrong sample index barcodes.
If we ran cellranger
with a sample sheet that contained the wrong barcodes,
then the reads that did not match the barcodes will have been saved into the
Undetermined_*.fastq.gz
files.
Let’s have a look at one of them:
$ gzip -cd Undetermined_S0_L001_I1_001.fastq.gz | head
@A00442:381:HF5K3DSXY:1:1101:2239:1016 1:N:0:NGGGGGGG
NGGGGGGG
+
#FFF:F:F
@A00442:381:HF5K3DSXY:1:1101:2709:1016 1:N:0:NGGGGGGG
NGGGGGGG
+
#FFFFFFF
@A00442:381:HF5K3DSXY:1:1101:3902:1016 1:N:0:NTAAGGTA
NTAAGGTA
The last colon-delimited entry in each line has the sample barcode (e.g.,
NGGGGGGG
, NTAAGGTA
).
To learn more about the FASTQ file formats, read the bcl2fastq documention.
Count sample barcodes in a FASTQ file#
Here’s one way to count the abundance of each barcode and list the top 10 most abundant barcodes:
$ gzip -cd Undetermined_S0_L001_I1_001.fastq.gz \
| grep '^@' | cut -d: -f10 | sort | uniq -c | sort -k1rn | head -n10
35643135 GCGTACAC
31411951 ATTGCGTG
26271617 CGACTTGA
22104605 TACAGACT
6264332 AGGATCGA
5934908 CACGATTC
5873602 TCTCGACT
5714926 GTATCGAG
1774897 GGGGGGGG
184548 TGCGAACT
Each of the top 4 barcodes have 22-35M reads, and the 5th one has just 6M reads.
If we look for the top 4 barcodes in the index files from 10x Genomics, here’s what we find:
# Single_Index_Kit_T_Set_A.csv
SI-GA-C5,CGACTTGA,TACAGACT,ATTGCGTG,GCGTACAC
Conclusions#
-
Since all four of the most abundant barcodes from the FASTQ file match with the
SI-GA-C5
sequences, we might conclude thatSI-GA-C5
is the correct barcode to use for this data. -
Since we have a large number (22-35M) of reads in the
Undetermined_*.fastq.gz
files, it is likely that our sample sheet was wrong when we rancellranger mkfastq
. We should re-run with a corrected sample sheet.