In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy
will defend her dissertation
Authorship Attribution for Realistic Scenarios
Majority of the previous works on authorship attribution make many assumptions about the data while designing their problem. They assume that the candidate author set size is small and that documents of substantial lengths are available for each author. Also, they only consider the single genre scenario where the texts with known authorship are of the same topic and genre as the text for which we are trying to perform attribution. In today's world where most communication happens online, the text is likely to be short and the anonymity that social media offers makes it hard to narrow down the candidate authors. Moreover, for domains such as emails, we might not be able to garner in-domain data and we need to be able to use data from more readily available sources such as tweets and reviews. We devise a more practical, albeit challenging problem that is closely aligned with possible real-world authorship attribution problems. We consider short documents, a long list of possible authors, and the ability to leverage datasets from any available domain to train our models. In this work, we build neural network based models that create a well-rounded representation of the input text. A good representation of text must be able to catch the smallest of signals present in the text that can point towards the author. Only such a model can work for short texts. This can also help the models be fairly robust to an increasing number of authors. Our results show that we were indeed able to achieve this. Our cross-domain representations are capable of distilling out the topic-specific attributes of the text such that what remains is purely owing to an author's style. This will ensure that the attribution performance does not degrade when we move from in-domain data to cross-domain data. It is essential for authorship attribution methods to work for realistic scenarios, even though this adds more complexity to the task. We find that it is indeed possible to create methods that can perform well even in these challenging situations.
Date: Wednesday, April 18, 2018
Time: 11:30 AM
Place: PGH 501D
Advisor: Dr. Thamar Solorio
Faculty, students, and the general public are invited.