José Carlos Almeida Santos,

Imperial College London

Abstract:

This presentation will show the application of data mining techniques, in particular of machine learning, for discovery of knowledge in a protein database.
The main problem we address is the determination whether an amino acid is exposed or buried in a protein for five exposition levels: 2%, 10%, 20%, 25% and 30%.
First we introduce the baseline classifier for this problem which, although very simple (only takes into account the amino acid type), already achieves good prediction results. Then we explain how, by making a local PDB database, retrieving DSSP and SCOP data, we build our classifier to improve the baseline prediction.
Finally we test and compare several classifiers (Neural Networks, C5.0, CART and Chaid), and parameters that might influence the prediction accuracy. Namely the level of information per amino acid, the SCOP class of the protein and the neighbourhood of the current amino acid (i.e.: the sliding window size).

Keywords: Amino acid Relative Solvent Accessibility, Protein Structure
Prediction, Data Mining, BioInformatics, Artificial Intelligence

 

Date: 2007-Feb-01     Time: 16:00:00     Room: 336


For more information: