In Partial Fulfillment of the Requirements for the Degree of
Master of Science
Will defend his thesis
Technologies that have recently become available to perform tasks in Next Generation Sequencing (NGS) and post-sequencing analysis allow the production of unprecedented amounts of data, typically ranging from hundreds of gigabytes to several terabytes. These massive data sets are often analyzed and used at locations different from where they have been generated. Consequently, terabytes of NGS data need to be sent over the Internet, in order to be processed by the special-purpose hardware that most bioinformatics researchers do not have available on-site, to enable collaborative research by teams of investigators at different locations, to make efficient use of cloud computing, or to simply share data. For these reasons, scalable NGS data formatting and compression are research challenges of utmost importance. In particular, new methods need be developed to perform computation and analysis efficiently on large sets of NGS data. We emphasize the importance of placing the main emphasis of these new methods on efficient memory management and the handling of time complexity, since these two resources are typically the main bottle-necks in NGS computations. This thesis discusses several file formats used to store large amounts of sequencing data such as FASTA, FASTQ and AS, and introduces a new format, named Compressed AS-format (CAS). The focus of our work is on the performance with respect to time and memory complexity when working with NGS data in CAS format compared to the existing formats. We will also discuss the underlying algorithm used for data compression in CAS.
Date: Tuesday, April 24, 2012
Time: 2:30 PM
Faculty, students, and the general public are invited.
Advisor: Prof. Yuriy Fofanov