Wolfgang Resch - Notes

Matching up homologous genes from different organisms with homologene

May 6, 2014

Homologene is an NCBI resource that constructs putative homology groups across species based on NCBI’s gene database. The resource can be queried interactively or through the eutils API. In addition, the data can be downloaded from the HomoloGene ftp site in a long form table with the following columns:

HID (HomoloGene group id)
Taxonomy ID
Gene ID
Gene Symbol
Protein gi
Protein accession

For example, here are the rows for the homologene group containing myc. Highlighted rows are H. sapiens (taxid 9696) and M. musculus (taxid 10090):

hid taxid geneid gene symbol protein id protein accession
31092 9606 4609 MYC 71774083 NP_002458.2
31092 9598 464393 MYC 218563723 NP_001136266.1
31092 9544 694626 MYC 218847750 NP_001136345.1
31092 9615 403924 MYC 153070853 NP_001003246.2
31092 9913 511077 MYC 114050751 NP_001039539.1
31092 10090 17869 Myc 71834865 NP_034979.3
31092 10116 24577 Myc 71834866 NP_036735.2
31092 9031 420332 MYC 73661206 NP_001026123.1
31092 7955 30686 myca 153946419 NP_571487.2
31092 7955 393141 mycb 41055786 NP_956466.1

The following python script will create a table of all matches between human and mouse, including instances where there are 1-to-n mappings, rather than just 1-to-1 mappings

import sys
import collections
 
# taxids
MOUSE = 10090
HUMAN =  9606
 
def d():
    return {"h": [], "m": []}
table = collections.defaultdict(d)
 
for i, line in enumerate(open(sys.argv[1])):
    if i == 0:
        continue
    hid, taxid, geneid, symbol, _, _ = line.split("\t")
    taxid = int(taxid)
    if taxid == MOUSE:
        org = "m"
    elif taxid == HUMAN:
        org = "h"
    else:
        continue
    table[hid][org].append(geneid)
print >>sys.stderr, "read %i homologenes" % len(table)
 
print "hid\tmouse\thuman"
skipped = 0
for hid, hgene in table.items():
    # omit the human genes that don't have a mouse entry
    # and mouse genes that don't have a human entry
    if len(hgene["h"]) == 0 or len(hgene["m"]) == 0:
        skipped += 1
        continue
    for mouse in hgene["m"]:
        for human in hgene["h"]:
            print "%s\t%s\t%s" % (hid, mouse, human)
print >>sys.stderr, "skipped %i entries" % skipped

In the case of human to mouse, HomoloGene build 67 had 17461 mappings for 16784 homologous groups.