BioArXiV article on Gene Annotation Bias

Bioinformatics News

Similar Posts

Image Description
Lab opens in Colorado!
Tue 30 May 2017 |

Hello! I have recently moved to from the …

Image Description
Mass Spectrometer Delivered!
Sun 20 Aug 2017 |

We acquired a Thermo Q-Exactive HF mass spectrometry, …

Image Description
AHA BCVS Meeting 2017

For this year's BCVS in Portland, I co-chaired …

Image Description
PubPularDB and FaBian Updates
Wed 06 Sep 2017 |

Our lab recently received funding from the NIH …

Image Description
Quantifying the value of basic science
Thu 07 Sep 2017 |

In this era of constrained research funding, the …

Image Description
Lab receives 5-year R01 funding
Sat 02 Jun 2018 |

We are excited to receive a generous 5-year …

Image Description
Popular protein manuscript released on bioRxiv
Tue 25 Sep 2018 |

Our manuscript on popular proteins across the human …

Image Description
Recent publications October 2020
Fri 30 Oct 2020 |

Check out some recent publications from our team …

Image Description
Combining machine learning and targeted mass spectrometry to validate protein isoforms
Tue 16 Feb 2021 |

Why identify protein isoforms?Alternative splicing plays a very …

Image Description
Lab awarded 1.96 millions to study stress response and senesence
Thu 06 Oct 2022 |

Highlighted in Dean's weekly message of CU School …

Image Description
Received NoA for R01 Renewal
Fri 24 Mar 2023 |

Today we are ecstatic to receive the Notice …

Oct. 20, 2017, 3:57 p.m.


We previously published a data science method PubPular to calculate the co-occurrence of genes and queried topics on PubMed articles, and suggested that normalized co-publication distance (NCD) may be used as a metric to discover the connection its between a gene/protein and a disease or physiological process.

A manuscript posted on the bioRxiv preprint server earlier this year examined how bibliometrics methods such as PubPular may be used to determine the state of gene annotation in biomedical research. The articles authored by Haynes, Tomczak and Khatri at Stanford University borrowed the Gini coefficient from economics to measure the inequality in annotation counts between different genes.

Gini coefficient was developed by the sociologist Corrado Gini in 1912 as a statistic on the dispersion of income distribution in a nation. In a perfectly egalitarian society the cumulative share of residents ranked from lowest to highest income should rise diagonally in a 1-to-1 ratio to the cumulative share of wealth. The index is calculated as the inverse ratio between area between the actual measured curve in a nation’s wealth distribution to this ideal diagonal line, which falls from 0 (perfect equality) to 1 (maximal inequality).

The same statistic could be calculated between the ranks of genes and the cumulative number of annotations. Using Pubpular, Reactome, Gene Ontology and other data sources, the authors of the bioRxib preprint manuscript found that there exists huge inequality in gene annotations where a few handful of genes are responsible for the bulk of gene annotation. Moreover, there is a tendency for the rich to get richer, as biomedical researchers gravitate towards well-characterized genes in their research efforts.

To the interested reader, this raises a few follow-up questions, for instance, what is the remedy to this situation? We would argue that an increase in hypothesis-free and discovery-driven research may help shed more light on the involvement of unannotated orphan genes in various biological processes, as candidate targets in these are often nominated by their quantitative behaviors (e.g., significant up- or down- regulation) rather than prior knowledge. A similar view was advanced by the authors of the preprint article to “search outside the streetlight” such as by exploring gene signatures in meta-analysis as the authors have done.

In addition, one may also argue that more emphasis given by predicted annotations based on sequence and biophysical features that can be automatically generated for all genes, which can help “equalize” the bias of gene annotations.

Having said that, a critical question that should also be asked is what the ideal scenario of gene annotation should look like? Wheras in terms of societal fairness it might be argued that the Gini index should ideally fall as close to 0 as possible, this is not as clear for the case for gene annotations. If the network of biological molecules follow the power law of a scale-free network, the number of neighbors (degrees) of molecular nodes might reasonably be expected to vary greatly. In other words, if hub genes exist and are preferentially connected to other genes, they may also be preferentially involved in physiological and pathological processes that are the subject of annotation.

The manuscript by Haynes et al. shows an interesting analysis of the state of PubMed literature. Link to article.

Image Credit - Markus Spiske