Department of Computer Science at UH

University of Houston

Department of Computer Science

In Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy

Carlos Garcia-Alvarado

Will defend his dissertation proposal

Integrating and Querying Information in Relational Databases and Interrelated Documents

Abstract

Along with the proliferation of cheap storage and more efficient CPUs, autonomous heterogeneous semistructured sources have been created. These large heterogeneous sources are difficult to query and explore, even those having a common origin from a structured source. Multiple solutions have been proposed attempting to manage the data resulting from this pervasive integration problem via ad-hoc systems. Our research defends the idea that integrating and managing a collection of semistructured data and a central database repository (structured data), within a database management system (DBMS), is efficient in medium-size collections and allows complex querying and knowledge discovery. In order to perform this combined querying, we present several data layouts and algorithms for extracting hidden links at different granularity levels between the metadata (table names and columns names) and content (records) in a DBMS and the keywords within a corpus of heterogeneous sources (e.g. documents, source code, and spreadsheets, among others). These algorithms focus on efficiently creating, managing summarization tables and keyword matching routines using standard SQL queries and extensibility features of the DBMS (e.g. User-Defined Functions). Ultimately, these links can be ranked, explored and queried by several search algorithms that we introduce. As a result, we extend relational queries to handle documents as well as provide a complexity analysis of the proposed algorithms. Furthermore, we present additional knowledge discovery techniques (stream clustering and ontology extraction) within the DBMS to explore and manage these common unstructured sources.

Date: Tuesday, November 22, 2011
Time: 1:00 PM
Place: 501-PGH (Chairman's office)
Faculty, students, and the general public are invited.
Advisor: Dr. Carlos Ordonez