Count the number of coding base pairs in each Gencode gene

2013-12-23

#Python

We can use Python to count the coding base pairs in each Gencode gene. Here, we report the base pair count by gene rather than by transcript. When we encounter different transcripts for the same gene with overlapping exons, we only count those base pairs once rather than multiple times.

Download the gencode coordinates:

ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz

chr1  HAVANA  gene        11869  14412  .  +  .  gene_id "ENSG00000223972.4";
chr1  HAVANA  transcript  11869  14409  .  +  .  gene_id "ENSG00000223972.4";
chr1  HAVANA  exon        11869  12227  .  +  .  gene_id "ENSG00000223972.4";
chr1  HAVANA  exon        12613  12721  .  +  .  gene_id "ENSG00000223972.4";
chr1  HAVANA  exon        13221  14409  .  +  .  gene_id "ENSG00000223972.4";

Also, download the NCBI mappings from Entrez GeneID to Ensembl identifiers:

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2ensembl.gz

9606  1  ENSG00000121410  NM_130786.3     ENST00000263100  NP_570602.2     ENSP00000263100
9606  2  ENSG00000175899  NM_000014.4     ENST00000318602  NP_000005.2     ENSP00000323929
9606  3  ENSG00000256069  NR_040112.1     ENST00000543404  -               -
9606  9  ENSG00000171428  NM_000662.5     ENST00000307719  NP_000653.3     ENSP00000307218
9606  9  ENSG00000171428  XM_005273679.1  ENST00000517492  XP_005273736.1  ENSP00000429407

Run the script below to get the following output:

python coding_lengths.py \
  -g gencode.v19.annotation.gtf.gz -n gene2ensembl.gz -o output.gz

Ensembl_gene_identifier  GeneID  length
ENSG00000000005          64102   1339
ENSG00000000419          8813    1185
ENSG00000000457          57147   3755
ENSG00000000938          2268    3167

We can plot the output with R like so:

d = read.delim("output.gz")
png("coding_lengths.png")
plot(density(log10(d$length)),
     main="Gencode v19 lengths of coding regions by gene")
dev.off()

Gencode v19 coding gene lengths

Source code#

Download coding_lengths.py

Please note that this code requires a class called GTF, defined in GTF.py