Department of Computer Science at UH

University of Houston

Department of Computer Science

In Partial Fulfillment of the Requirements for the Degree of
Master of Science

Ricardo E. Mauricio

Will defend his thesis


Towards a Fault Tolerant OpenSHMEM

Abstract

The ever-growing computational requirements of scientific and engineering applications continue to drive new developments in High Performance Computing (HPC). These new developments are being challenged as scalability of HPC applications is at risk due to the high failure rates of current parallel computers.

Presently many applications are designed to run for days or longer on even the fastest computers, in spite of the high probability of failure attributed to the growth in system size and complexity. These levels of unreliability create a requirement for the use of fault-tolerance mechanisms to ensure the timely and correct completion of applications. However, regardless of the high degree of optimization, existing techniques do not fulfill the challenges posed by large-scale machines.

This thesis explores failures in parallel computing systems and how they affect HPC applications. In addition, we discuss potential enhancements to the OpenSHMEM specification in order to enable the development of fault-tolerant applications. OpenSHMEM is an effort to create a standard API for the development of portable and highly scalable Partitioned Global Address Space (PGAS) applications.

 

Date: Thursday, April 19, 2012
Time: 03:00 PM
Place: 550-PGH

Faculty, students, and the general public are invited.
Advisor: Dr. Barbara Chapman