Bioinformatics tools for human gene variation

The study of sequence variation in human genes is crucial for an understanding of the relationship between genotype and phenotype. Sequence variants are normally stored in locus-specific databases (LSDBs), with variants being described with respect to a reference sequence. Until recently, such reference sequences have not been totally suited to the specific needs of LSDBs and, to address this, a new reference sequence format has been developed. The Locus Reference Genomic (LRG) format (Dalgleish et al., 2010) will greatly aid the curation of LSDBs, but there is a need for the development of more sophisticated tools to fully exploit the power of the new sequence format.

The international ‘GEN2PHEN’ project, in which I am a partner, is building database components, tools and technologies to help integrate information pertaining to genome variation and human disease phenotype. Sadly, effective use of the tools produced by this project, and by others, is hampered because potential users feel that the learning curve is too steep.

A particular case in point is the creation of LSDBs of sequence variation in disease-causing genes. There are precious few such databases (~1,300) relative to the number of known human genes (~25,000) and to those genes for which sequence variants have been described in the literature (~3,500). Potential LSDB curators are probably put off becoming involved by the valid perception that the undertaking is too complex.

We intend to address this perception by critically analysing the factors that are a disincentive to the creation of LSDBs. One such factor, that is well-recognised but poorly researched, is the common difficulty of extracting sequence-variant descriptions from published accounts because the journals do not insist on descriptions that implement the accepted HGVS variant nomenclature. Once this issue and other contributory factors are better understood, development of improved software tools and supporting training materials can proceed.

Project aims

  • Identifying the perceived barriers to the creation of LSDBs and determining the extent to which these barriers exist because of training issues and lack of guidance materials
  • Investigating why journals do not strictly mandate the adoption of standard mutation description nomenclature
  • Generating and evaluating improved tools and training materials

Specific bioinformatics activities

  • Investigating and validating improved methods for generating alignments of current and legacy reference sequence to aid in the interpretation of published variant data
  • Examining automated generation of nomenclature-compliant variant descriptions directly from sequence alignment tools
  • Investigating the submission of such variant descriptions directly into LSDBs

The proposed tools will have graphical interfaces to improve usability and there will be an emphasis on standards-compliance and the use or adaptation of existing code libraries.


Dalgleish R et al. (2010) Genome Med 2, 24.

Howard HJ et al. (2009) Hum Mutat 31, 366-367.

Funding notes

There is no specific funding available for this project. Applicants should expect to provide their own funding and ought to have a degree in bioinformatics, preferably to Masters level.


Professor Raymond Dalgleish
+44 (0)116 252 3425