BIOINFORMATICS


What is Bioinformatics?

Bioinformatics is the study of molecular biology as an information science. Specifically, it studies the application of computers to the understanding and, eventually, manipulation of the molecules of life.

Over the past decade, but particularly over the past ten years, there has been an explosion of work being carried out at the interface between computing and molecular biology. Part of the motivation for this new interest is the large (and growing) amount of information coming from laboratories, including those working on the human genome and related projects, which can only be managed by use of computer technology. However, computers also offer a range of tools and analyses that were not possible in the past. Experiments are done "in silico" whose predictions are subsequently tested by bench biologists. From the point of view of the computer scientist, bioinformatics presents a range of extremely interesting and very challenging problems. These problems are generally very machine intensive.

International Society for Computational Biology

Established in 1997, the International Society for Computational Biology is dedicated to advancing the scientific understanding of living systems through computation; its emphasis is on the role of computing and informatics in advancing molecular biology. The Society works to achieve these aims through its sponsorship of meetings, principally: Intelligent Systems in Molecular Biology (ISMB) Pacific Symposium on Bioinformatics (PSB) and Recomb, and through. its affiliated journals (Bioinformatics and Journal of Computational Biology). It also provides a number of other benefits of members. The Society can also be reached via its web site: www.iscb.org.

Wise Group Research in Bioinformatics

Research in bioinformatics/computational biology has been active in a group led by Assoc. Prof. Michael Wise since the start of 1993, first in the Department of Computer Science, Sydney University (latterly renamed School of Information Technologies), then, between 1999 and mid 2004 at Pembroke College, Cambridge and Department of Genetics, University of Cambridge. Since August 2004 the Wise Laboratory has moved to the School of Biomedical and Chemical Sciences, University of Western Australia.

Conspectus of Bioinformatics Projects

A number of projects are currently underway or have been completed within the Bioinformatics Research group.
Analysis of Proteins by Peptide Composition
The Meaning of Low Complexity Peptides
Recognizing Microbial Communities using Restriction Fragments
Simulation of Metabolic Pathways
Python Protein Annotators' Assistant
Alignment Algorithms Revisited
Characterizing the 'Best' Proteases for Protein-Mass Fingerprinting
Neweyes: A System for Comparing Biological Sequences Using the Running
Python/Tk Viewer for the NCBI Taxonomy Database

Analysis of Proteins by Peptide Composition

The method of choice for most biologists faced with protein sequence data is to compare their sequences against those in a protein database such as SwissProt using the Smith-Waterman algorithm, e.g. Scanps or approximations to the Smith-Waterman algorithm, such as BLAST. If several closely related sequences are known, more sensitive searches can be undertaking by first constructing a multiple-sequence alignment, e.g. using Clustal-W. While these search tools have become very powerful, for each method there are caveats that users need to be aware of in order to make best use of the most appropriate tool for the data they have, and the scientific question they wish to address.

An alternative to these (with its own strength and weaknesses) is The POPPs, a suite of inter-related software tools which allow the user to discover what is statistically "unusual" in the composition of an unknown protein, or to automatically cluster proteins into families based on peptide composition. In addition, the user can search for related proteins based on peptide composition. Techniques based on peptide composition are effective because protein sequences are far from random, particularly in portions of the proteins such as catalytic domains that are under strong conservation pressure. The POPPs refines such comparisons by basing them peptide probabilities rather than raw counts, thus removing biases due to the inherent likelihood of certain peptides over others. Statistically based peptide composition also provides a view of proteins that is, to some extent, orthogonal to that provided by sequence. In a proof-of-concept study, The POPP suite is able to regroup into their families sets of approximately 100 randomised Pfam protein domains, and is used to explore a set of late embryogenesis abundant (LEA) proteins.

A paper discussing The POPPs was presented at the Tenth International Symposium on Intelligent Systems in Molecular Biology (Edmonton, Canada, Aug 3-7, 2002). A preprint can be found here.

The system is available under license free-of-charge to non-commecial/academic users. The license agreement may be found here. Researchers from the commercial domain wishing to obtain the suite should use this license.

The Meaning of Low Complexity Peptides

Low complexity proteins are thus named because their sequences are very non-random, i.e. their sequences display significant regularities. Low complexity proteins are widely used in cells to create extended/structural proteins and protein domains such as collagen, keratin and spider dragline filament, as opposed to the more widely studied globular proteins. Biological studies have also implicated low complexity sequences in several disease processes, e.g. polyglutamine stutters found in a number of neuropathologies such as Huntington's disease and the spinocerebrellar ataxias or the repeats evident in the Prion protein.

Earlier computer-based studies suffered from the fact that the algorithms employed in the software tools were fairly crude - albeit very efficient, so this project aims to construct novel software tool which will far more accurately demarcate low complexity proteins than has been possible to date, and then to deploy the tool to undertake a systematic examination of the world of low complexity protein domains.

Early in-silico studies by Wootton found that low complexity proteins are involved in development and in DNA binding. Preliminary studies I have undertaken have considerably expanded this work. Firstly, the distribution of repeated peptides across species appears to be skewed, with eukaryotes significantly favouring them, and prokaryotes and viruses avoiding them, the exception being some prokaryotes and viruses that parasitise eukaryotes. (This finding is, in part, similar to a conclusion drawn by Marcotte et al. that protein repeats are more prevalent in eukaryotic proteins when compared with prokaryotic proteins.) Secondly, the broad class of repeated peptides can be broken into two subgroups: tandem and almost-tandem repeats (involving different amino acids) and stutters (i.e. poly-amino-acid peptides). The preliminary work has revealed that repeats are associated with the keywords: antigen, surface, collagen and proteoglycan, while stutters are associated with nuclear proteins, DNA binding, transcription and development. A report on the preliminary work was presented at the Ninth International Conference on Intelligent Systems in Molecular Biology (Copenhagen, July 21-25, 2001) whose proceedings also appeared as a special issue of the journal Bioinformatics. A preprint can be found here.

The system is available under license free-of-charge to non-commecial/academic users. The license agreement may be found here. Researchers from the commercial domain wishing to obtain the suite should use this license.

Microbial Community Composition and Flux from Restriction Fragment Data

A suite of databases and programs is being developed which analyse the data generated by restriction endonuclease fragment analysis of PCR amplified genes. Using these programs biologists will be able to study both the structure of, and change in microbial communities. Specifically, these programs use data generated by a recently developed fingerprinting technique: Terminal-Restriction Fragment Length Polymorphism (T-RFLP) analysis. T-RFLP uses fluorescent automated sequencing technology to generate a profile of restriction fragment lengths from a particular gene of interest, e.g. Small Subunit rRNA, drawn from creatures in a particular microbial community and amplified using the polymerase chain reaction. The actual size of each fragment is calculated by reference to a fluorescently labelled internal standard.

A novel software system, TRUFFLER, has been developed to mimic this process in silico. Because a given combination of forward and reverse primers and restriction endonuclease can yield identical tail restriction fragment lengths across a number of species, combinations of different endonucleases (and/or primers) are typically used. In addition to fragment length data, data on fluorescence levels is also available. Computationally, this can be viewed as a constraint satisfaction problem which can be solved to allow identification of the dominant members of the microbial community, often down to individual species or at least genus level, and their relative proportions.

This work is being carried out together with Dr Mark Osborn, Department of Biological Sciences, University of Essex. A paper describing TRUFFLER has been accepted for the 2nd IEEE International Symposium on Bioinformatics and Bioengineering to be held in Bethesda, Maryland, USA from November 4-6, 2001. A preprint may be found here

Simulation of Metabolic Pathways

DiMSim is a novel methodology for simulating the flow of metabolites through networks of interacting metabolic reactions. Much as biochemists do, DiMSim views the reactions involved in metabolic processes as a graph and uses this as the basis of a simulation system. In a nutshell, each reaction is viewed as a simple automaton which "fires" when the complete set of inputs are available producing outputs which are consumed by other reactions, and so on. (The choice of "inputs" over "substrates" is significant; reactions are typically reversible, so which molecules are substrate and which are product can change, while enzymes can be thought of as being both.) The DiMSim approach is in marked contrast to other metabolic simulation systems which are based on systems of differential equations. The benefits of the approach adopted by DiMSim are:

How would I use DiMSim?

DiMSim promises to be particularly effecting in helping users understand (and though that intervene) in complex biological systems involving multiple interacting pathways. Once comprehensive models for pathways of interest have been built, experiments can be carried out which may not even be possible in the physical system. For example, if multiple reactions produce a metabolite into a single pool, with other reactions are consuming that metabolite, it is well nigh impossible to determine the relative contributions by examining the flux of that metabolite in the pool. In DiMSim, this can be done quite easily. On the other hand, to model how metabolites on different pathways react to a particular condition, multiple charts can be set up to track each versus time, or pairwise, each against a common-denominator metabolite.

Another situation is where an intermediate reaction is not known. In this case one possibility would be to create a notional reaction in that place and then tune the parameters via experimental data. Extending this methodology, where a pathway is suspected via data from the literature (or sources such as BioCyc or Kegg), or via mRNA expression-array data, a framework-pathway can be created and then refined as kinetic data becomes available, most importantly turnover numbers and K eq.

A paper describing DiMSim appeared in: Xiao-Qin Xia and Michael J. Wise, ``DiMSim: A Discrete-Event Simulator of Metabolic Networks'', Journal of Chemical Information and Computer Science, 43:1011-1019, 2003. Unfortunately, the American Chemical Society does not permit authors to post preprints on their web pages. If you or your institution has a subscription, the URL of the article is http://pubs.acs.org/cgi-bin/article.cgi/jcisd8/2003/43/i03/html/ci025650w.html

DiMSim is available under license free-of-charge to non-commecial/academic users. The license agreement may be found here. Researchers from the commercial domain wishing to obtain the suite should use this

Python Protein Annotators' Assistant

When a possible new protein has been identified, e.g. a putative protein coding region (or "ORF") in a newly sequenced genome, a similarity search is made of protein databases to see if the protein has already been described. If it has already been described, the story ends there, but if it has not, the existence of closely related proteins (e.g. from related organisms) may convey some information about the new one. What similarity searches return are lists of identifiers, each identifier denoting a sequence in a database together with a short title. This level of information may be sufficient to characterize the new protein. However, for more distantly related proteins the link between the query protein and listed proteins may not be obvious. For example, proteins often need to perform several functions, each of which is encoded in one portion of the the amino-acids. It is therefore possible that two proteins are distantly related, not because their overall functions are related, but because they have one or more intermediate functions in common. (However, even identifying these intermediates can provide valuable information.)

In this project, a software tool has been developed which, given a list of protein identifiers, e.g. as returned by a BLAST or FASTA search, clusters the identifiers around keywords and phrases that might indicate the functions performed by the protein that was used in the original search query. A PAA server is now in place at www.ebi.ac.uk/paa. The system is being developed in Python because of the far greater programmer productivity available from software development done in Python. A paper providing an overview of PAA from a biologists point of view appeared in Trends in Biochemical Systems (TiBS), 25:252-253, May 2000. A preprint may be found here. A preprint of longer, more computer science oriented discussion of PAA, which appeared in the October 2000 issue of the Journal of the American Society for Information Systems (JASIS), may be found here.

PAA is available under license free-of-charge to non-commecial/academic users. The license agreement may be found here. Researchers from the commercial domain wishing to obtain the suite should use this

Alignment Algorithms Revisited

For several years, the Smith-Waterman local alignment algorithm, whether implemented directly (e.g. ssearch3) or approximately (e.g. BLAST or FASTA), has been the algorithm of choice for protein database searches, because it is more likely to find distant homologues and because it produces more biologically meaningful alignments than the global alignment methods that preceded it. This study reexamines the efficacy of local versus global alignment algorithm to answer a different, somewhat weaker question than the direct search for homologues: which of a range of global and local alignment algorithms is most likely to predict whether two proteins share a common domain based on structure, structure-class, or function, perhaps even through convergent evolution. Among other results, evidence is found for the proposition that the Smith-Waterman algorithm is most effective when two proteins have a common domain or have the same function, i.e. are probably homologous. However, when only weak relationships exist, global methods appear at least as effective as local ones. Global methods are shown to provide a somewhat different point of view to local methods, and can therefore be used in addition to local methods to improve search accuracy, even when higher level matches are possible. A paper describing this work is to appear in the European Control Conference ECC 2003, Cambridge, U.K., September 1-4, 2003. A longer preprint can be found here.

Characterizing the "Best" Proteases for Protein-Mass Fingerprinting

At present, given a range of possible proteases such as trypsin, chymostrypsin or cyanogen bromide, and an unknown protein which is to be characterised, more or less ad-hoc methods are used to select the proteases that will be applied to the protein prior to protein-mass fingerprinting. This study set out to discover which protease, or set of proteases, are "best" at characterizing an unknown protein, e.g. by requiring fewer digests of the source protein. A lengthy paper describing this work appeared as: ``Peptide-mass Fingerprinting and the Ideal Covering Set for Protein Characterisation'', Michael J. Wise, Tim Littlejohn and Ian Humphery-Smith, Electrophoresis, 18(8), 1997. A shorter description of this project, emphasizing more the algorithmics, appeared in Fifth International Conference on Intelligent Systems for Molecular Biology, Halkidiki, Greece, 1997, and can be found here.

Neweyes: A System for Comparing Biological Sequences Using the Running Karp-Rabin Greedy String-Tiling Algorithm

Neweyes is a software system for comparing a DNA or protein sequence with another sequence, or with sequences from one of the standard databases such as SwissProt (a potein database) or EMBL (a DNA database). While the sizes of these database is constantly growing, SwissProt holds around 42,000 sequences covering approximately 14,200,000 amino acids; EMBL holds around 170,000 sequences covering approximately 180,000,000 nucleotides. Searching these databases presents considerable challenges. Neweyes employs an efficient and novel string matching algorithm, Running Karp-Rabin Greedy String-Tiling, which is able to detect transposed substrings, something that the systems currently in use find very difficult. In practice, the RKR-GST algorithm has computational complexity that appears close to linear. For a more complete description of Neweyes see http://www.pam1.bcs.uwa.edu.au/~michaelw/ftp/doc/neweyes.ps, which appeared in the Third International Conference on Intelligent Systems for Molecular Biology, Cambridge, 1995. For a lengthier discussion of Running Karp-Rabin, Greedy String Tiling see also the unpublished paper this unpublished paper.

Python/Tk Viewer for the NCBI Taxonomy Database

A viewer for the NCBI taxonomy database, written in Python/Tk, was developed in 1998 by two VERY talented first year students, Tanya Golubchik and Stephen Graham (part of the Sydney University, Science Faculty Talented Students Program). Source code is here.