Department of Computer Science at UH

University of Houston

Department of Computer Science

In Partial Fulfillment of the Requirements for the Degree of
Master of Science

Carlos Garcia-Alvarado

Will defend his thesis

Linking and Exploring Database Metadata and Documents
with SQL and UDFs

Abstract

Structured and unstructured data have been studied separately in database systems and information retrieval research, respectively. Despite this, a new interest in integrating both fields is being inspired by search engines techniques. Our research focuses on establishing links between a database and the documents surrounding it, and exploiting a DBMS query language (SQL), as well as its extensibility features like User-Defined Functions (UDFs) and stored procedures. In this thesis, we first study how to adapt classical information retrieval techniques for working inside a DBMS using SQL queries and UDFs. We then study how to efficiently compute top-k queries that are particularly difficult to optimize in SQL. Finally, we study how to match keywords in documents with keywords in a database, at different storage granularity levels. Specifically, we study how to establish links between metadata information such as table and column names with documents based on exact matches. This matching process is systematically carried out in a bottom-up fashion. Experimental evaluation shows our system can efficiently analyze and explore a digital library with thousands of documents.

Date: Friday, July 31, 2009
Time: 10:00 AM
Place: 563-PGH
Faculty, students, and the general public are invited.
Advisor: Dr. Carlos Ordonez