Molecular Proteomics Laboratory

Similar Posts

Lab opens in Colorado!

Tue 30 May 2017 |

Hello! I have recently moved to from the …

Mass Spectrometer Delivered!

Sun 20 Aug 2017 |

We acquired a Thermo Q-Exactive HF mass spectrometry, …

AHA BCVS Meeting 2017

For this year's BCVS in Portland, I co-chaired …

PubPularDB and FaBian Updates

Wed 06 Sep 2017 |

Our lab recently received funding from the NIH …

Quantifying the value of basic science

Thu 07 Sep 2017 |

In this era of constrained research funding, the …

Lab receives 5-year R01 funding

Sat 02 Jun 2018 |

We are excited to receive a generous 5-year …

Popular protein manuscript released on bioRxiv

Tue 25 Sep 2018 |

Our manuscript on popular proteins across the human …

Recent publications October 2020

Fri 30 Oct 2020 |

Check out some recent publications from our team …

Combining machine learning and targeted mass spectrometry to validate protein isoforms

Tue 16 Feb 2021 |

Why identify protein isoforms?Alternative splicing plays a very …

Lab awarded 1.96 millions to study stress response and senesence

Thu 06 Oct 2022 |

Highlighted in Dean's weekly message of CU School …

Received NoA for R01 Renewal

Fri 24 Mar 2023 |

Today we are ecstatic to receive the Notice …

Oct. 20, 2017, 3:57 p.m.

Maggie-Lam

We previously published a data science method PubPular to calculate the co-occurrence of genes and queried topics on PubMed articles, and suggested that normalized co-publication distance (NCD) may be used as a metric to discover the connection its between a gene/protein and a disease or physiological process.

A manuscript posted on the bioRxiv preprint server earlier this year examined how bibliometrics methods such as PubPular may be used to determine the state of gene annotation in biomedical research. The articles authored by Haynes, Tomczak and Khatri at Stanford University borrowed the Gini coefficient from economics to measure the inequality in annotation counts between different genes.

Gini coefficient was developed by the sociologist Corrado Gini in 1912 as a statistic on the dispersion of income distribution in a nation. In a perfectly egalitarian society the cumulative share of residents ranked from lowest to highest income should rise diagonally in a 1-to-1 ratio to the cumulative share of wealth. The index is calculated as the inverse ratio between area between the actual measured curve in a nation’s wealth distribution to this ideal diagonal line, which falls from 0 (perfect equality) to 1 (maximal inequality).

The same statistic could be calculated between the ranks of genes and the cumulative number of annotations. Using Pubpular, Reactome, Gene Ontology and other data sources, the authors of the bioRxib preprint manuscript found that there exists huge inequality in gene annotations where a few handful of genes are responsible for the bulk of gene annotation. Moreover, there is a tendency for the rich to get richer, as biomedical researchers gravitate towards well-characterized genes in their research efforts.

To the interested reader, this raises a few follow-up questions, for instance, what is the remedy to this situation? We would argue that an increase in hypothesis-free and discovery-driven research may help shed more light on the involvement of unannotated orphan genes in various biological processes, as candidate targets in these are often nominated by their quantitative behaviors (e.g., significant up- or down- regulation) rather than prior knowledge. A similar view was advanced by the authors of the preprint article to “search outside the streetlight” such as by exploring gene signatures in meta-analysis as the authors have done.

In addition, one may also argue that more emphasis given by predicted annotations based on sequence and biophysical features that can be automatically generated for all genes, which can help “equalize” the bias of gene annotations.

Having said that, a critical question that should also be asked is what the ideal scenario of gene annotation should look like? Wheras in terms of societal fairness it might be argued that the Gini index should ideally fall as close to 0 as possible, this is not as clear for the case for gene annotations. If the network of biological molecules follow the power law of a scale-free network, the number of neighbors (degrees) of molecular nodes might reasonably be expected to vary greatly. In other words, if hub genes exist and are preferentially connected to other genes, they may also be preferentially involved in physiological and pathological processes that are the subject of annotation.

The manuscript by Haynes et al. shows an interesting analysis of the state of PubMed literature. Link to article.

Image Credit - Markus Spiske

BioArXiV article on Gene Annotation Bias

Similar Posts