Thesis Defense - University of Houston
Skip to main content

Thesis Defense

In Partial Fulfillment of the Requirements for the Degree of Master of Science

Saba Khan

will defend her thesis

A Parallel Implementation of the Pandas Framework


High performance is a highly desirable trait for applications today. Companies large and small are migrating their serial applications to parallel versions to reduce execution time and increase efficiency. However, preparing serial applications for parallel processing is not a simple process. Pandas, which is a Python library containing rich data structures and tools, is used abundantly in data science applications. However, the Pandas framework is built for single-core processing and is unable to fully utilize multi-core processors or cluster technology. Because of this limitation, Pandas users are forced to look for other frameworks when working with large quantities of data. This thesis introduces a Parallel-Pandas library which makes the process of parallelizing serial Pandas applications easy and transparent. The Parallel-Pandas library provides Pandas users the ability to upgrade existing applications transparently by using only a library import. This thesis contains details about the design decisions and implementation of the Parallel-Pandas library. The Parallel-Pandas library is evaluated with unit testing, microbenchmarks, and a real-world application with different datasets. Parallel-Pandas has also been compared with PySpark, a framework that provides parallelism by following the MapReduce structure. The results presented in this paper show that the Parallel-Pandas library has promising potential, and delivers performance close to manually-parallelized and tuned applications.

Date: Tuesday, April 14, 2020
Time: 3:00 PM
Place: Online Presentation - MS Teams
Advisor: Dr. Edgar Gabriel

Faculty, students, and the general public are invited.