In Partial Fulfillment of the Requirements for the Degree of
Master of Science
Will defend his thesis
The decreasing cost and increasing availability of DNA sequences has made vast amounts of genetic sequences available, hence making the process of annotation lag behind. Annotation of DNA sequences refers to the process in which various DNA sequences in a genome are identified and classified as a particular gene, binding site of important molecules such as transcription factor, promoter region of a gene, restriction enzyme recognition site, etc. Annotation is an ongoing process even for genomes which have already been annotated and for genomes which were unfinished at the time of last annotation.
In order to speed up the annotation process, computational analysis can be used in order to spot patterns in DNA sequences with statistically significant properties. The patterns detected using computational analysis may or may not have functional roles but such analysis can give a good set of patterns in the DNA sequences to the researches for further investigation, in order to speed up the annotation process. In our work, we have developed an algorithm to efficiently encode patterns with gaps in a DNA sequence to facilitate the enumeration of all such patterns and have used this enumeration to detect some well known patterns whose prevalence or rarity suggested their possibility of having functional roles. This encoding can have other applications in the development sequence analysis tools.
The algorithm along with supporting statistical analysis successfully detected some well known patterns, hence demonstrating the usefulness of the scheme which encodes patterns with gaps while enumerating all patterns in a genome as per user’s pattern space criteria. The results produced include several examples of possibly important patterns. We present results on Escherichia coli and Vibrio Cholerae. We found Rho binding factor utilization site to be the top pattern in a targeted search for similar patterns while an exhaustive search revealed the site to be in the top 1/3.2 million most numerous patterns. All the rarest patterns encoded in Escherichia coli have tetramer ‘CTAG’, which is well known to be rare in Escherichia coli.