In Partial Fulfillment of the Requirements for the Degree of
Master of Science
Will defend his thesis
Principal Component Analysis (PCA) is one of the most common dimensionality reduction techniques with broad applications in data mining, statistics and signal processing. PCA finds a new set of orthogonal dimensions represented by linear combinations of input dimensions to project data into a new lower dimensionality space, preserving the variability existing on the original data. Given the mathematical complexity of PCA it has been traditionally computed outside a database system, forcing the user to export the data set. In this thesis we show it is feasible to solve PCA via Singular Value Decomposition (SVD) entirely inside a DBMS, without any external numerical analysis library. Our solution is based on dividing computation into two phases: one to derive a correlation matrix and and a second one to solve SVD using the correlation matrix as input. Based on such approach our method can efficiently analyze a large data set in a single pass, eliminating the need to export it and allowing the user to exploit a DBMS extensive functionality (e.g. querying, security). To solve SVD inside the DBMS, we introduce two basic solutions: one based exclusively on SQL queries and a second one based on User-Defined Functions for some key equations. Experimental evaluation shows our method can solve larger problems and in less time than state-of-the-art external statistical packages. In summary, our proposal extends a database system with PCA, a well-known and powerful data mining technique.