Matching up homologous genes from different organisms with homologene
May 6, 2014Homologene is an NCBI resource that constructs putative homology groups across species based on NCBI’s gene database. The resource can be queried interactively or through the eutils API. In addition, the data can be downloaded from the HomoloGene ftp site in a long form table with the following columns:
HID (HomoloGene group id)
Taxonomy ID
Gene ID
Gene Symbol
Protein gi
Protein accession
For example, here are the rows for the homologene group containing myc. Highlighted rows are H. sapiens (taxid 9696) and M. musculus (taxid 10090):
hid | taxid | geneid | gene symbol | protein id | protein accession |
---|---|---|---|---|---|
31092 | 9606 | 4609 | MYC | 71774083 | NP_002458.2 |
31092 | 9598 | 464393 | MYC | 218563723 | NP_001136266.1 |
31092 | 9544 | 694626 | MYC | 218847750 | NP_001136345.1 |
31092 | 9615 | 403924 | MYC | 153070853 | NP_001003246.2 |
31092 | 9913 | 511077 | MYC | 114050751 | NP_001039539.1 |
31092 | 10090 | 17869 | Myc | 71834865 | NP_034979.3 |
31092 | 10116 | 24577 | Myc | 71834866 | NP_036735.2 |
31092 | 9031 | 420332 | MYC | 73661206 | NP_001026123.1 |
31092 | 7955 | 30686 | myca | 153946419 | NP_571487.2 |
31092 | 7955 | 393141 | mycb | 41055786 | NP_956466.1 |
The following python script will create a table of all matches between human and mouse, including instances where there are 1-to-n mappings, rather than just 1-to-1 mappings
import sys
import collections
# taxids
MOUSE = 10090
HUMAN = 9606
def d():
return {"h": [], "m": []}
table = collections.defaultdict(d)
for i, line in enumerate(open(sys.argv[1])):
if i == 0:
continue
hid, taxid, geneid, symbol, _, _ = line.split("\t")
taxid = int(taxid)
if taxid == MOUSE:
org = "m"
elif taxid == HUMAN:
org = "h"
else:
continue
table[hid][org].append(geneid)
print >>sys.stderr, "read %i homologenes" % len(table)
print "hid\tmouse\thuman"
skipped = 0
for hid, hgene in table.items():
# omit the human genes that don't have a mouse entry
# and mouse genes that don't have a human entry
if len(hgene["h"]) == 0 or len(hgene["m"]) == 0:
skipped += 1
continue
for mouse in hgene["m"]:
for human in hgene["h"]:
print "%s\t%s\t%s" % (hid, mouse, human)
print >>sys.stderr, "skipped %i entries" % skipped
In the case of human to mouse, HomoloGene build 67 had 17461 mappings for 16784 homologous groups.