In Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
Will give a preliminary defense of his dissertation
Most real world applications involving complex data intensive processing are done outside the DBMS on data exported from the DBMS. However, the DBMS provides limited capabilities for performing complex computations, complex data types and size of input parameters and of the returnable results. Although some DBMSs provide extended programming capabilities, its hard to express complex computations efficiently using SQL. We try to alleviate these limitations imposed by the DBMS with greater control for the user, complex data types such as user defined types for passing data, incrementally return partially computed results. We show how to accelerate computation of sufficient statistics on large data sets with UDFs exploiting caching and sampling techniques. Classification is one of the hardest problems in machine learning. We have built a Bayesian classifier based on class decomposition using the distance based clustering algorithm K-Means which requires many scans on the data set. Experiments performed on data sets from UCI repository show our classifier is more accurate and robust than Naive Bayes and decision trees but is the most slowest. These data scans become a huge bottleneck with large data sets. When dealing with large data sets, partially computed results might be more useful or sufficient enough for many data intensive applications. Hence, we propose a Bayesian classifier which incrementally updates its model as we scan through the table. The primary objective is to effect the implementation of the Bayesian classification model completely inside the database system limiting the number of table scans.
Date: Monday, May 7, 2012
Time: 3:00 PM
Faculty, students, and the general public are invited.
Advisor: Dr. Carlos Ordonez