BioArXiV article on Gene Annotation Bias


We previously published a data science method PubPular to calculate the co-occurrence of genes and queried topics on PubMed articles, and suggested that normalized co-publication distance (NCD) may be used as a metric to discover the connection its between a gene/protein and a disease or physiological process.

A manuscript posted on the bioRxiv preprint server earlier this year examined how bibliometrics methods such as PubPular may be used to determine the state of gene annotation in biomedical research. The articles authored by Haynes, Tomczak and Khatri at Stanford University borrowed the Gini coefficient from economics to measure the inequality in annotation counts between different genes.

Gini coefficient was developed by the sociologist Corrado Gini in 1912 as a statistic on the dispersion of income distribution in a nation. In a perfectly egalitarian society the cumulative share of residents ranked from lowest to highest income should rise diagonally in a 1-to-1 ratio to the cumulative share of wealth. The index is calculated as the inverse ratio between area between the actual measured curve in a nation’s wealth distribution to this ideal diagonal line, which falls from 0 (perfect equality) to 1 (maximal inequality).

The same statistic could be calculated between the ranks of genes and the cumulative number of annotations. Using Pubpular, Reactome, Gene Ontology and other data sources, the authors of the bioRxib preprint manuscript found that there exists huge inequality in gene annotations where a few handful of genes are responsible for the bulk of gene annotation. Moreover, there is a tendency for the rich to get richer, as biomedical researchers gravitate towards well-characterized genes in their research efforts.

To the interested reader, this raises a few follow-up questions, for instance, what is the remedy to this situation? We would argue that an increase in hypothesis-free and discovery-driven research may help shed more light on the involvement of unannotated orphan genes in various biological processes, as candidate targets in these are often nominated by their quantitative behaviors (e.g., significant up- or down- regulation) rather than prior knowledge. A similar view was advanced by the authors of the preprint article to “search outside the streetlight” such as by exploring gene signatures in meta-analysis as the authors have done.

In addition, one may also argue that more emphasis given by predicted annotations based on sequence and biophysical features that can be automatically generated for all genes, which can help “equalize” the bias of gene annotations.

Having said that, a critical question that should also be asked is what the ideal scenario of gene annotation should look like? Wheras in terms of societal fairness it might be argued that the Gini index should ideally fall as close to 0 as possible, this is not as clear for the case for gene annotations. If the network of biological molecules follow the power law of a scale-free network, the number of neighbors (degrees) of molecular nodes might reasonably be expected to vary greatly. In other words, if hub genes exist and are preferentially connected to other genes, they may also be preferentially involved in physiological and pathological processes that are the subject of annotation.

The manuscript by Haynes et al. shows an interesting analysis of the state of PubMed literature. Link to article.

Image Credit - Markus Spiske

Similar Posts