Calendar - University of Houston
Skip to main content

[Defense] Discover Fine-Grained Latent Information using Pre-Trained Language Models

Tuesday, April 27, 2021

10:00 am - 12:00 pm

In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy
Yifan Zhang
will defend his dissertation
Discover Fine-Grained Latent Information using Pre-Trained Language Models


In this work, we explore several methods to address two major areas in Natural Language Processing: Sentiment Analysis and Authorship Problems. All of these proposed methods are based on some form of deep learning neural network models.

For sentiment analysis, we proposed several iterations of a framework called the Sentiment-Aspect Attribution Module (SAAM). SAAM works on top of traditional neural networks and is designed to address the problem of multi-aspect sentiment classification and sentiment regression. The framework works by exploiting the correlations between sentence-level embedding features and variations of document-level aspect rating scores. We demonstrate several variations of our framework on top of CNN and RNN based models. Experiments on a hotel review dataset and a beer review dataset have shown SAAM can improve sentiment analysis performance over corresponding base models. Moreover, because of the way our framework intuitively combines sentence-level scores into document-level scores, it is able to provide a deeper insight into data (e.g., semi-supervised sentence aspect labeling). Hence, we end the paper with a detailed analysis that shows the potential of our models for other applications such as sentiment snippet extraction.

For authorship analysis, we focus our research on authorship attribution, authorship verification and style change detection. As a part of this work, we also create and make available to the public a multi-label Authorship Attribution dataset (MLPA-400), consisting of 400 scientific publications by 20 authors from the field of Machine Learning. We then explore the use of Convolutional Neural Networks (CNNs) for multi-label Authorship Attribution (AA) problems and propose a CNN specifically designed for such tasks. Additionally, we also propose an unsupervised solution to the Authorship Verification task that fine-tunes a pre-trained deep language model to compute a new metric called {DV-Distance. The proposed metric is a measure of the difference between two authors that takes into account the knowledge transferred from the pre-trained model. Our design addresses the problem of non-comparability in authorship verification, frequently encountered in small or cross-domain corpora.

To the best of our knowledge, our work is the first one to introduce a method designed with non-comparability in mind from the ground up, rather than indirectly. It is also one of the first to use Deep Language Models in this setting. The approach is intuitive, and it is easy to understand and interpret through visualization. Performance-wise, our method is significantly faster than much of the competition: the winner of the PAN 2015 challenge has a runtime of 21 hours 44 minutes; it takes our model 1 minute to produce more accurate predictions. Experiments on six datasets show our approach matching or surpassing current state-of-the-art and strong baselines in most tasks. Both MASA and DV-Distance have a lot of room for improvement. We will continue to improve and conduct additional analysis/improvement cycles on both ideas. We hope the contributions in this work will help both in making advancements in these tasks as well as gaining more insights into the mechanism of neural networks in general.

Tuesday, April 27, 2021
10:00AM - 12:00PM CT
Online via MS Teams

Dr. Arjun Mukherjee, dissertation advisor

Faculty, students and the general public are invited.