Identifying Popular Proteins and Annotating Gene Functions

Project contributors:

Advances in large-scale “omics” approaches have led to an explosion of “gene list” data – lists of genes/proteins implicated in biological models resulting from discovery experiments. Connecting implicated molecules to useful knowledge is currently a major bottleneck in data interpretation. Gene Ontology (GO), KEGG, and other annotation sources are commonly used to analyze gene lists, but existing annotations are often better represented in lower-level functional concepts (e.g., whether a gene codes for a kinase or a cytosolic protein) while less complete in higher physiological concepts (e.g., how relevant is the gene to gastrointestinal processes).

To help translate these molecular data from “omics” studies into biological knowledge, we are working with HUPO investigators to create new annotation strategies to make sense of discovered genes and connect to known disease processes. Recently, our publications have demonstrated that public PubMed data can be mined to identify the semantic similarity between each gene/protein and higher physiological functions, and indeed any ontologies or concepts determined by the user that are searchable on PubMed (Lam et al. JACC 2015 and J Proteome Res 2016). We currently host a web app, “PubPular”, to allow users to specify a biomedical topic (e.g., cancer) and retrieve its popularly associated genes. PubPular is accessible via

Figure: Popular proteins in the heart and in the liver vs. their weighted publication counts.