Computer Science Seminar - University of Houston
Skip to main content

Computer Science Seminar

Breaking New Ground in Authorship Analysis: Leveraging Data in Cross-Domain Settings

Seminar Slides: Download (PPT)

When: Monday, March 24, 2014
Where: PGH 232
Time: 11:00 AM

Speaker: Prof. Thamar Solorio, University of Alabama at Birmingham

Host: Prof. Ricardo Vilalta

Authorship Analysis (AA) is the task of modeling the writeprint of authors to determine authorship of a document, generate a profile of the author, or identify cases of plagiarism. Most previous work in AA assumes the availability of samples with known authorship that closely match the domain from the documents of interest. A strong assumption like this one limits the applications of AA approaches. In this talk I will discuss research that addresses this key outstanding challenge. A first step in cross-topic settings consists on leveraging documents with known authorship from different topics. The next step is to take advantage of the large amounts of free text available representing each cross-domain setting to learn general lexical, stylistic, and syntactic distributional correspondences. These correspondences are used to map the out-of-domain texts to a representation that is closer to the target domain.

Direct contributions of this research include new approaches to extract and embed cross-domain prior knowledge into AA models in the form of distributional trajectories; and a solid understanding of the influence of topic and genre in the feature engineering process for AA that will also be helpful in other text processing tasks. This research advances the field of forensic linguistics, which is of major relevance for national security.

In the last part of my talk I will briefly describe ongoing research in the areas of mixed-language processing, information extraction in clinical records and user generated data, and natural language processing for language assessment in clinical settings.

Bio:
Thamar Solorio joined the Department of Computer and Information Sciences at UAB in 2009 as an Assistant Professor. Before that she was a Research Associate in the Department of Computer Science at the University of Texas at Dallas. She is founder and co-director of the Computational Representation and Analysis of Language (CoRAL) lab. Her research interests include designing new large-scale corpus-based approaches to authorship attribution (AA) in social media, AA in cross-domain settings, analysis of code-switched data in social media, and clinical applications of language processing, including information extraction from patient records, user generated data, and the modeling of disordered language. Her research is supported by the National Science Foundation and the Office of Naval Research. She recently received the National Science Foundation Early CAREER Award for her work on cross-domain AA.