BIOINFORMATICS
What is Bioinformatics?
Bioinformatics is the study of molecular biology as an
information science.
Specifically, it studies the application of computers
to the understanding and, eventually, manipulation of the molecules of life.
Over the past decade, but particularly over the past ten years, there
has been an explosion of work being carried out at the interface
between computing and molecular biology.
Part of the motivation for this new interest is the large (and growing) amount
of information coming from laboratories, including those working on
the human genome and related projects, which can only be managed
by use of computer technology.
However, computers also offer a range of tools and analyses
that were not possible in the past.
Experiments are done "in silico" whose predictions are subsequently tested by
bench biologists.
From the point of view of the computer scientist,
bioinformatics presents a range of extremely interesting and
very challenging problems.
These problems are generally very machine intensive.
International Society for Computational Biology
Established in 1997, the International Society for Computational Biology is
dedicated to advancing the scientific understanding of
living systems through computation; its emphasis is on
the role of computing and informatics in advancing
molecular biology. The Society works to achieve these
aims through its sponsorship of meetings, principally: Intelligent
Systems in Molecular Biology (ISMB) Pacific
Symposium on Bioinformatics (PSB) and
Recomb, and through.
its affiliated journals (Bioinformatics
and Journal of Computational Biology).
It also provides a number of other benefits of members.
The Society can also be reached via its web site:
www.iscb.org.
Wise Group Research in Bioinformatics
Research in bioinformatics/computational biology has been active
in a group led by Assoc. Prof. Michael Wise
since the start of 1993, first in the Department of Computer Science,
Sydney University (latterly renamed
School of Information Technologies),
then, between 1999 and mid 2004 at
Pembroke College, Cambridge
and Department of Genetics,
University of Cambridge.
Since August 2004 the Wise Laboratory has moved to the
School of Biomedical and
Chemical Sciences, University of
Western Australia.
Conspectus of Bioinformatics Projects
A number of projects are currently underway or have been completed within
the Bioinformatics Research group.
Analysis of Proteins by Peptide Composition
The Meaning of Low Complexity Peptides
Recognizing Microbial Communities using Restriction Fragments
Simulation of Metabolic Pathways
Python Protein Annotators' Assistant
Alignment Algorithms Revisited
Characterizing the 'Best' Proteases for Protein-Mass Fingerprinting
Neweyes: A System for Comparing Biological Sequences Using the Running
Python/Tk Viewer for the NCBI Taxonomy Database
Analysis of Proteins by Peptide Composition
The method of choice for most biologists faced with protein sequence data
is to compare their sequences against those in a protein database such as
SwissProt using the Smith-Waterman algorithm, e.g. Scanps
or approximations to the Smith-Waterman algorithm, such as BLAST.
If several closely related sequences are known, more sensitive searches
can be undertaking by first constructing a multiple-sequence alignment,
e.g. using Clustal-W.
While these search tools have become very powerful,
for each method there are caveats that users need to be aware
of in order to make best use of the most appropriate tool for the data
they have, and the scientific question they wish to address.
An alternative to these (with its own strength and weaknesses)
is The POPPs, a suite of inter-related software tools which allow the user
to discover what is statistically "unusual" in the composition of an
unknown protein, or to automatically cluster proteins into families
based on peptide composition.
In addition, the user can search for related proteins based on
peptide composition.
Techniques based on peptide composition are effective because protein
sequences are far from random, particularly in portions of the proteins
such as catalytic domains that are under strong conservation pressure.
The POPPs refines such comparisons by basing them peptide probabilities
rather than raw counts, thus removing biases due to
the inherent likelihood of certain peptides over others.
Statistically based peptide composition also provides a view
of proteins that is, to some extent, orthogonal to that provided
by sequence.
In a proof-of-concept study, The POPP suite is able to regroup into
their families sets of approximately 100 randomised
Pfam protein domains, and
is used to explore a set of late embryogenesis
abundant (LEA) proteins.
A paper discussing The POPPs was presented at the Tenth International
Symposium on Intelligent Systems in Molecular Biology
(Edmonton, Canada, Aug 3-7, 2002).
A preprint can be found here.
The system is available under license free-of-charge to non-commecial/academic users.
The license agreement may be found
here.
Researchers from the commercial domain wishing to obtain the suite
should use
this
license.
The Meaning of Low Complexity Peptides
Low complexity proteins are thus named because their sequences are very
non-random, i.e. their sequences display significant regularities.
Low complexity proteins are widely used in cells to create extended/structural
proteins and protein domains such as collagen, keratin and spider dragline
filament, as opposed to the more widely studied globular proteins. Biological
studies have also implicated low complexity sequences in several disease
processes, e.g. polyglutamine stutters found in a number of neuropathologies
such as Huntington's disease and the spinocerebrellar ataxias
or the repeats evident in the Prion protein.
Earlier computer-based studies suffered from the
fact that the algorithms employed in the software tools were fairly crude
- albeit very efficient, so this project aims to construct novel software
tool which will far more accurately demarcate low complexity proteins than
has been possible to date, and then to deploy the tool to undertake a systematic
examination of the world of low complexity protein domains.
Early in-silico studies by Wootton found that low complexity proteins are involved
in development and in DNA binding.
Preliminary studies I have undertaken have considerably expanded this work. Firstly, the
distribution of repeated peptides across species appears to be skewed, with
eukaryotes significantly favouring them, and prokaryotes and viruses avoiding
them, the exception being some prokaryotes and viruses that parasitise eukaryotes.
(This finding is, in part, similar to a conclusion drawn by Marcotte et al.
that protein repeats are more prevalent in eukaryotic proteins
when compared with prokaryotic proteins.) Secondly, the broad class of repeated
peptides can be broken into two subgroups: tandem and almost-tandem repeats
(involving different amino acids) and stutters (i.e. poly-amino-acid peptides).
The preliminary work has revealed that repeats are associated with the
keywords: antigen, surface, collagen and proteoglycan, while stutters are
associated with nuclear proteins, DNA binding, transcription and development.
A report on the preliminary work was presented at the Ninth International
Conference on Intelligent Systems in Molecular Biology (Copenhagen,
July 21-25, 2001) whose proceedings also appeared as a
special issue of the journal Bioinformatics.
A preprint can be found here.
The system is available under license free-of-charge to non-commecial/academic users.
The license agreement may be found
here.
Researchers from the commercial domain wishing to obtain the suite
should use
this
license.
Microbial Community Composition and Flux from Restriction Fragment Data
A suite of databases and programs is being developed which analyse the
data generated by restriction endonuclease fragment analysis of PCR
amplified genes.
Using these programs biologists will be able to study both the structure of,
and change in microbial communities.
Specifically, these programs use data generated by a recently developed
fingerprinting technique: Terminal-Restriction Fragment Length Polymorphism
(T-RFLP) analysis.
T-RFLP uses fluorescent automated sequencing technology to generate
a profile of restriction fragment lengths from a particular gene
of interest, e.g. Small Subunit rRNA,
drawn from creatures in a particular microbial community and
amplified using the polymerase chain reaction.
The actual size of each fragment is calculated by reference to a
fluorescently labelled internal standard.
A novel software system, TRUFFLER, has been developed to mimic this process
in silico.
Because a given combination of forward and
reverse primers and restriction endonuclease can yield identical
tail restriction fragment lengths
across a number of species, combinations of different endonucleases
(and/or primers) are typically used. In addition to fragment length data,
data on fluorescence levels is also available. Computationally, this can
be viewed as a constraint satisfaction problem which can be solved to
allow identification of the dominant members of the microbial
community, often down to individual species or at least genus level, and
their relative proportions.
This work is being carried out together with Dr Mark Osborn,
Department of Biological Sciences, University of Essex.
A paper describing TRUFFLER has been accepted for the
2nd IEEE International Symposium on Bioinformatics and
Bioengineering to be held in Bethesda, Maryland, USA from
November 4-6, 2001. A preprint may be found
here
Simulation of Metabolic Pathways
DiMSim is a novel methodology for simulating the flow of metabolites
through networks of interacting metabolic reactions. Much as biochemists
do, DiMSim views the reactions involved in metabolic processes as a
graph and uses this as the basis of a simulation system. In a nutshell,
each reaction is viewed as a simple automaton which "fires" when the
complete set of inputs are available producing outputs which are consumed
by other reactions, and so on. (The choice of "inputs" over "substrates"
is significant; reactions are typically reversible, so which molecules
are substrate and which are product can change, while enzymes can be thought
of as being both.) The DiMSim approach is in marked contrast to other
metabolic simulation systems which are based on systems of differential
equations. The benefits of the approach adopted by DiMSim are:
-
Because DiMSim is able to model down to single molecules, DiMSim
is better able to cope with the special characteristics of biochemical
systems, including reversible reactions and non-linear behaviour,
e.g. due to competition between reactions for limited quantities of
reactants or products, or non-linear behaviourdue to trace
concentrations or cascades. DiMSim is also able to model
membrane-bound compartments and the channels used to transpor
metabolites between them (both passive diffusion and active transport).
-
Branching and cyclic pathways are readily modelled in DiMSim,
as are different forms of inhibition, including competitive and
allosteric.
-
The inputs to a DiMSim pathway model, the underlying computational
model and therefore results that are produced are all understandable
and interpretable by biologists, because the system relies on common
biochemical parameters such as Km, Keq and turn-over numbers. The
outputs of the system are concentrations of metabolites of
interest, generally viewed as a chart. In fact, the DiMSim system
itself makes few assumptions so non Michaelis-Menten kinetics can
be modelled.
-
The DiMSim approach promises to be more scalable (able to grow
gracefully) than approaches based on systems of differential
equations, because each additional pathway has only local effects.
We also believe DiMSim will be less sensitive to changes
and errors in parameter values because flux is an emergent
property of the system (being the sum-total of the flows of
metabolites through reactions and therefore the output of
a process similar to numerical integration), rather than the property
that is being directly modelled.
How would I use DiMSim?
DiMSim promises to be particularly effecting in helping users understand
(and though that intervene) in complex biological systems involving
multiple interacting pathways. Once comprehensive models for pathways
of interest have been built, experiments can be carried out which may
not even be possible in the physical system. For example, if multiple
reactions produce a metabolite into a single pool, with other reactions
are consuming that metabolite, it is well nigh impossible to determine
the relative contributions by examining the flux of that metabolite
in the pool. In DiMSim, this can be done quite easily. On the other
hand, to model how metabolites on different pathways react to a
particular condition, multiple charts can be set up to track each
versus time, or pairwise, each against a common-denominator metabolite.
Another situation is where an intermediate reaction is not known. In
this case one possibility would be to create a notional reaction in
that place and then tune the parameters via experimental data. Extending
this methodology, where a pathway is suspected via data from the
literature (or sources such as BioCyc or Kegg), or via mRNA expression-array
data, a framework-pathway can be created and then refined as kinetic
data becomes available, most importantly turnover numbers and K eq.
A paper describing DiMSim appeared in:
Xiao-Qin Xia and Michael J. Wise, ``DiMSim: A Discrete-Event Simulator of Metabolic
Networks'', Journal of Chemical Information and Computer Science,
43:1011-1019, 2003.
Unfortunately, the American Chemical Society does not permit authors
to post preprints on their web pages.
If you or your institution has a subscription, the URL of the article
is
http://pubs.acs.org/cgi-bin/article.cgi/jcisd8/2003/43/i03/html/ci025650w.html
DiMSim is available under license free-of-charge to non-commecial/academic users.
The license agreement may be found
here.
Researchers from the commercial domain wishing to obtain the suite
should use
this
Python Protein Annotators' Assistant
When a possible new protein has been identified, e.g. a putative protein
coding region (or "ORF") in a newly sequenced
genome, a similarity search is made of protein databases to see if
the protein has already been described.
If it has already been described, the story ends there, but if it has not, the existence of
closely related proteins (e.g. from related organisms) may convey some
information about the new one.
What similarity searches return are lists of identifiers, each identifier denoting
a sequence in a database together with a short title.
This level of information may be sufficient to characterize the new protein.
However, for more distantly related proteins the link between the query protein
and listed proteins may not be obvious.
For example, proteins often need to perform several functions, each
of which is encoded in one portion of the the amino-acids.
It is therefore possible that two proteins are distantly related,
not because their overall functions are related, but because
they have one or more intermediate functions in common.
(However, even identifying these intermediates can provide valuable information.)
In this project, a software tool has been developed which,
given a list of protein identifiers, e.g. as returned by a
BLAST or FASTA search, clusters the identifiers around
keywords and phrases that might indicate the functions performed by the
protein that was used in the original search query.
A PAA server is now in place at www.ebi.ac.uk/paa.
The system is being developed in Python
because of the far greater programmer productivity available from software
development done in Python.
A paper providing an overview of PAA from a biologists point of view
appeared in Trends in Biochemical Systems (TiBS), 25:252-253, May 2000.
A preprint may be found here.
A preprint of longer, more computer science oriented discussion of PAA,
which appeared in the October 2000 issue of the Journal of the American
Society for Information Systems (JASIS), may be found
here.
PAA is available under license free-of-charge to non-commecial/academic users.
The license agreement may be found
here.
Researchers from the commercial domain wishing to obtain the suite
should use
this
Alignment Algorithms Revisited
For several years, the Smith-Waterman local alignment algorithm,
whether implemented directly (e.g. ssearch3) or approximately
(e.g. BLAST or FASTA), has been the algorithm of choice for protein
database searches, because it is more likely to find distant homologues
and because it produces more biologically meaningful alignments than
the global alignment methods that preceded it.
This study reexamines the efficacy of local versus global alignment
algorithm to answer a different, somewhat weaker question than the direct search
for homologues: which of a range
of global and local alignment algorithms is most likely to predict
whether two proteins share a common domain based on structure, structure-class,
or function, perhaps even through convergent evolution.
Among other results, evidence is found for the proposition that the
Smith-Waterman algorithm is most effective when two proteins have
a common domain or have the same function, i.e. are probably homologous.
However, when only weak relationships exist, global methods appear at least
as effective as local ones.
Global methods are shown to provide a somewhat different point of
view to local methods, and can therefore be used in addition to local methods to
improve search accuracy, even when higher level matches are possible.
A paper describing this work is to appear in the European Control Conference ECC 2003,
Cambridge, U.K., September 1-4, 2003.
A longer preprint can be found
here.
Characterizing the "Best" Proteases for Protein-Mass Fingerprinting
At present, given a range of possible proteases such as trypsin,
chymostrypsin or cyanogen bromide, and an unknown protein which
is to be characterised, more or less
ad-hoc methods are used to select the proteases that will be
applied to the protein prior to protein-mass fingerprinting.
This study set out to discover which protease, or set of proteases,
are "best" at characterizing an unknown protein, e.g.
by requiring fewer digests of the source protein.
A lengthy paper describing this work appeared as:
``Peptide-mass Fingerprinting and the Ideal Covering Set for Protein Characterisation'',
Michael J. Wise, Tim Littlejohn and Ian Humphery-Smith, Electrophoresis, 18(8), 1997.
A shorter description of this project, emphasizing more the algorithmics,
appeared in Fifth International Conference on Intelligent Systems for Molecular Biology,
Halkidiki, Greece, 1997, and can be found
here.
Neweyes: A System for Comparing Biological Sequences Using the Running
Karp-Rabin Greedy String-Tiling Algorithm
Neweyes is a software system for comparing a DNA or protein sequence with
another sequence, or with sequences from one of the standard databases such as
SwissProt (a potein database) or EMBL (a DNA database).
While the sizes of these database is constantly growing,
SwissProt holds around 42,000 sequences covering approximately 14,200,000 amino acids;
EMBL holds around 170,000 sequences covering approximately 180,000,000 nucleotides.
Searching these databases presents considerable challenges.
Neweyes employs an efficient and novel string matching
algorithm, Running Karp-Rabin Greedy String-Tiling, which is able to detect
transposed substrings, something that the systems currently in use find very
difficult. In practice, the RKR-GST algorithm has computational
complexity that appears close to linear.
For a more complete description of Neweyes see
http://www.pam1.bcs.uwa.edu.au/~michaelw/ftp/doc/neweyes.ps,
which appeared in the Third International Conference on Intelligent Systems
for Molecular Biology, Cambridge, 1995.
For a lengthier discussion of Running Karp-Rabin, Greedy
String Tiling see also the unpublished paper
this unpublished paper.
Python/Tk Viewer for the NCBI Taxonomy Database
A viewer for the NCBI taxonomy database, written in Python/Tk,
was developed in 1998 by two VERY talented first year students, Tanya Golubchik and
Stephen Graham (part of the
Sydney University, Science Faculty Talented Students Program).
Source code is here.