Protein function prediction using uncertainty
This BBSRC funded bioinformatics project ran from Nov 2004 to Oct 2007, with Dr Chris Needham working alongside Dr Andy Bulpitt in the School of Computing and Dr James Bradford working alongside Prof David Westhead in the Faculty of Biological Sciences. The project applies machine learning techniques to a range of protein function prediction tasks. One particularly important aspect of dealing with biological data is modelling uncertainty, since the processes for measuring biological systems introduce noise into the data. For this reason Bayesian approaches to learning have been explored -- where distributions of model parameters are considered and marginalised over, rather than just point estimates of model parameters.
Protein-protein interfaces
A novel system for predicting binding sites on protein surfaces. Identification
of binding sites, and which proteins a particular protein binds to gives clues
to the protein's function, since proteins need to bind to each other in order to
interact. This work developed a novel system for predicting binding sites on
protein surfaces. Incorporating a naive Bayes classifier into a prediction
scheme to integrate information from diverse physio-chemical properties of
interaction interfaces increases performance.
Bradford, James R; Needham, Chris J; Bulpitt, Andrew J; Westhead, David R. Insights into protein-protein interfaces using a Bayesian network prediction method. Journal of Molecular Biology, vol. 362, pp. 365-386. 2006.doi:10.1016/j.jmb.2006.07.028
For predictions, the PPI-pred server is online.
Functional effects of mutations
Mutations in DNA, such as SNPs (single nucleotide polymorphisms) or missense
mutations (those SNPs which cause an amino acid change) may or may not affect
the function of a protein. Genetic mutations happen naturally, and there are
many differences between the genetic code of individuals. However identification
of those mutations which cause disease or alter the functional effect of the
protein are of particular interest. This project has analysed the important
factors in predicting functional effects, comparing when only structural or
homology based attributes are used. Also, the suitability of different datasets
when making deleterious SNP predictions has been investigated. This is set to
become a hot topic area, as more human genomes are sequenced and analysed for
genetic variations.
Needham, Chris J; Bradford, James R; Bulpitt, Andrew J; Care, Matthew A; Westhead, David R. Predicting the effect of missense mutations on protein function: analysis with Bayesian networks. BMC Bioinformatics, vol. 7. 2006.doi:10.1186/1471-2105-7-405
Care, M A; Needham, C J; Bulpitt, A J; Westhead, D R. Deleterious SNP predictions: be mindful of your training data! Bioinformatics, vol. 23, pp. 664-672. 2007. doi:10.1093/bioinformatics/btl649
Protein function prediction
This work aims to describe the function of an unknown protein (gene product) by
predicting the Gene Ontology annotation for the protein. This is a huge
multi-class classification problem with potentially more possible classes
(descriptions) than data items. A Bayesian network with a structure that encodes
a subset of the Gene Ontology description is used and the remaining network
structure is learned in order to form a compact model which integrates
information from features derived from gene expression profiles, protein-protein
interactions and sequence motifs. Data from the plant Arabidopsis thaliana is
used.
Bayesian networks
Alongside this research, we have written two primers in top international
journals introducing Bayesian networks in a biological sciences context. The
first concentrates on introducing the basics of inference, with a cell
signalling pathway example. The second contains a substantial review of
learning in Bayesian networks, and covers conditional independence of variables,
joint probability distributions, parameter learning, structure learning, and
Bayesian methods for calculating marginal likelihood by averaging over
distributions of parameters, rather than making point estimates. At ISMB 2006
(Intelligent Systems in Molecular Biology Conference), we gave a four hour
tutorial on Bayesian networks for bioinformatics.
Needham, C J; Bradford, J R; Bulpitt, A J; Westhead, D R. A Primer on Learning in Bayesian Networks for Computational Biology. PLoS Computational Biology, vol. 3, pp. e129. 2007. doi:10.1371/journal.pcbi.0030129
Needham, C J; Bradford, J R; Bulpitt A J; Westhead D R. Inference in Bayesian networks in: Nature Biotechnology, Volume 24, Number 1, January 2006, Pages 51-53. doi:10.1038/nbt0106-51
Learning gene regulatory networks
Information on learning transcription networks from microarray data can be found
here.