In Partial Fulfillment of the Requirements for the Degree of Master of Science
will defend his thesis
A Checkpointing/Restart Approach for OpenSHMEM Fault Tolerance
Partitioned Global Address Space (PGAS) have emerged in recent year in parallel programming and High Performance Computing (HPC) which shows great potential on scalability. Among different languages and libraries, OpenSHMEM gain popularity due to the easy-to-use APIs and high performance one-sided communication as a PGAS library. With growing number of cores on current HPC systems, more failures tend to happen during the running time of HPC program. A program without fault tolerance will suffer from failure-restart loop which causes great uncertainty and resource cost. Fault tolerance for long running applications is critical to guard against failure of either compute resources or a network. Accomplishing this task in software is non-trivial and there is an added level of complexity for implementing a working model for a one-sided communications library like OpenSHMEM.
In this thesis work, we explored a fault tolerance scheme based on check-point and restart, that caters to the one-sided nature of PGAS programming model while leveraging features very specific to OpenSHMEM.
Through a working implementation with the 1-D Jacobi code, we show that the approach is scalable and provides considerable computational resource saving.
Date: Thursday, April 21, 2016
Time: 4:00 PM
Place: PGH 218
Advisor: Prof. Barbara Chapman
Faculty, students, and the general public are invited.