0-based and 1-based genomic intervals, overlap, and distance
2013-08-07
Here, I describe two kinds of genomic intervals and include source code for testing overlap and calculating distance between intervals.
You will find files specifying genomic coordinates in two formats:
0-based : 0 1 2 3 4 (UCSC, BED, bedGraph, narrowPeak)
1-based : 1 2 3 4 (NCBI, Ensembl, GFF, GTF, VCF, SAM, BAM, wiggle)
sequence: A T G C
0-based starts with 0 and numbers the spaces in between nucleotides.
1-based starts with 1 and numbers the nucleotides.
The subsequence TG
of the full string ATGC
is:
0-based : [1, 3)
1-based : [2, 3]
The 0-based style does not include the last position: )
The 1-based style includes the last position: ]
This results in different length calculations for subsequence TG
:
0-based : 3 - 1 = 2
1-based : 3 - 2 + 1 = 2
Read further here: https://genome.ucsc.edu/FAQ/FAQformat.html
Example#
>>> a, b = (1, 3), (3, 7)
>>> print_intervals0(a, b)
01234567890
==
====
>>> print_intervals1(a, b)
1234567890
===
=====
>>> overlap0(a, b)
False
>>> overlap1(a, b)
True
>>> distance0(a, b)
0
>>> distance1(a, b)
-1
# 0-based intervals
def overlap0(a, b):
"""Check if two 0-based intervals overlap."""
# a.start < b.end and a.end > b.start
return a[0] < b[1] and a[1] > b[0]
def distance0(a, b):
"""Get the number of bases between two 1-based intervals, 0 if the
intervals are book-ended against each other, or, if negative, the number
of bases in the overlap.
"""
return max(a[0] - b[1], b[0] - a[1])
def print_intervals0(*intervals):
start = min([i[0] for i in intervals])
stop = max([i[1] for i in intervals])
length = stop - start
print '0' + '1234567890' * ((length + 10) / 10)
for i in intervals:
spaces = ' ' * i[0]
marks = '=' * (i[1] - i[0])
print spaces + marks
# 1-based intervals
def overlap1(a, b):
"""Check if two 1-based intervals overlap."""
# a.start <= b.end and a.end >= b.start
return a[0] <= b[1] and a[1] >= b[0]
def distance1(a, b):
"""Get the number of bases between two 1-based intervals, 0 if the
intervals are book-ended against each other, or, if negative, the number
of bases in the overlap.
"""
return max(a[0] - b[1], b[0] - a[1]) - 1
def print_intervals1(*intervals):
start = min([i[0] for i in intervals])
stop = max([i[1] for i in intervals])
length = stop - start + 1
print '1234567890' * ((length + 10) / 10)
for i in intervals:
spaces = ' ' * (i[0] - 1)
marks = '=' * (i[1] - i[0] + 1)
print spaces + marks