[Defense] Efficient Machine Learning on Data Science Languages with Data Summarization
Wednesday, May 19, 2021
10:30 am - 12:00 pm
In
Partial
Fulfillment
of
the
Requirements
for
the
Degree
of
Doctor
of
Philosophy
Sikder
Tahsin
Al
Amin
will
defend
his
proposal
Efficient
Machine
Learning
on
Data
Science
Languages
with
Data
Summarization
Abstract
Nowadays, data science analysts prefer “easy” high-level languages for machine learning computation like R and Python, but they present memory and speed limitations. Also, scalability is another issue when the data set size grows. On the other hand, acceleration of machine learning algorithms can be achieved with data summarization which has been a fundamental technique in data mining. With these motivations in mind, we present an efficient way to compute the statistical and machine learning models with parallel data summarization that can work with popular data science languages. The summarization produces one or multiple summaries, accelerates a broader class of statistical and machine learning models, and requires a small amount of RAM. Here, the algorithm works in three phases and is capable to handle data sets bigger than the main memory. Our solution evaluates a vector-vector outer product to escape the bottleneck of high-level programming languages. The experimental evaluation section is presented with a prototype in the R and Python language where the summarization is programmed in C++. The experiments prove that the solution can work on both data subsets and full data set without any performance penalty. Also, the solution is compared for a single machine and in parallel. For a single machine, it has an edge over R and competitive with Python. And for parallel, it is faster than other parallel big data systems, Spark (Spark-MLlib library), and a parallel DBMS (similar approach implemented with UDFs and SQL queries). The solution is simpler and mostly faster than Spark based on the storage of the data set, and it is much faster than a parallel DBMS regardless of the storage of the data set.
Wednesday,
May
19,
2021
10:30AM
-
12:00PM
CT
Online
via
MS
Teams
Dr. Carlos Ordonez, dissertation advisor
Faculty, students and the general public are invited.
