Department of Computer Science at UH

University of Houston

Department of Computer Science

In Partial Fulfillment of the Requirements for the Degree of
Master of Science

Chong Wang

Will defend his thesis

Protein Clustering and Function Identification


The characteristics of proteins are usually studied and identified through biological or medical experiments. Unfortunately, the experimental determination of protein characteristics is an expensive and time consuming process. In recent research, with the aim to produce high-throughput biological information processing, scientists adopted high-speed computers and computer science techniques to assist in their biological studies. This research investigates the challenges and solutions of using advanced data mining techniques and clustering for high-throughput protein analysis. Furthermore, we studied the similarity of the information/knowledge extracted from three types of popular protein datasets: protein sequence alignment, protein structure comparison, and protein-protein interactions (PPI). It is generally accepted that protein sequence, protein structure, or protein interactions can be used to determine protein characteristics and can be used to infer the shared biochemical functions of the unknown proteins. We believe our effort is the first attempt on integrating information from three popular types of protein datasets, and our results show the potential of utilizing all three datasets to gain insight about protein function and relationship, which lay the foundation for a potential new bioinformatics direction. As integrated protein datasets were not available, we have to create our own protein datasets for the purpose of analyzing the effect of clustering on each type of protein dataset. Proximal evolutionary distances are proposed as protein sequence distances, which operate on E-values generated by the basic local alignment search tool (BLAST). The root-mean-square-deviation (RMSD) values are proposed as protein structure distances. The shared nearest-neighbors (SNN) density-based clustering algorithm is used to find clusters based on the results from protein sequence alignments and structure comparisons. In the PPI clustering study, the molecular complex detection (MCODE) algorithm is applied to detect protein complexes. This research designs co-occurrence matrices to assess similarities between different clustering results and an integration analysis matrix to assess the similarity scores of pairs of proteins.


Date: Friday, April 20, 2012
Time: 02:30 PM
Place: 362-PGH

Faculty, students, and the general public are invited.
Advisor: Prof. Christoph F. Eick