[Defense] Discover Fine-Grained Latent Information using Pre-Trained Language Models
Tuesday, April 27, 2021
10:00 am - 12:00 pm
In
Partial
Fulfillment
of
the
Requirements
for
the
Degree
of
Doctor
of
Philosophy
Yifan
Zhang
will
defend
his
dissertation
Discover
Fine-Grained
Latent
Information
using
Pre-Trained
Language
Models
Abstract
In
this
work,
we
explore
several
methods
to
address
two
major
areas
in
Natural
Language
Processing:
Sentiment
Analysis
and
Authorship
Problems.
All
of
these
proposed
methods
are
based
on
some
form
of
deep
learning
neural
network
models.
For
sentiment
analysis,
we
proposed
several
iterations
of
a
framework
called
the
Sentiment-Aspect
Attribution
Module
(SAAM).
SAAM
works
on
top
of
traditional
neural
networks
and
is
designed
to
address
the
problem
of
multi-aspect
sentiment
classification
and
sentiment
regression.
The
framework
works
by
exploiting
the
correlations
between
sentence-level
embedding
features
and
variations
of
document-level
aspect
rating
scores.
We
demonstrate
several
variations
of
our
framework
on
top
of
CNN
and
RNN
based
models.
Experiments
on
a
hotel
review
dataset
and
a
beer
review
dataset
have
shown
SAAM
can
improve
sentiment
analysis
performance
over
corresponding
base
models.
Moreover,
because
of
the
way
our
framework
intuitively
combines
sentence-level
scores
into
document-level
scores,
it
is
able
to
provide
a
deeper
insight
into
data
(e.g.,
semi-supervised
sentence
aspect
labeling).
Hence,
we
end
the
paper
with
a
detailed
analysis
that
shows
the
potential
of
our
models
for
other
applications
such
as
sentiment
snippet
extraction.
For
authorship
analysis,
we
focus
our
research
on
authorship
attribution,
authorship
verification
and
style
change
detection.
As
a
part
of
this
work,
we
also
create
and
make
available
to
the
public
a
multi-label
Authorship
Attribution
dataset
(MLPA-400),
consisting
of
400
scientific
publications
by
20
authors
from
the
field
of
Machine
Learning.
We
then
explore
the
use
of
Convolutional
Neural
Networks
(CNNs)
for
multi-label
Authorship
Attribution
(AA)
problems
and
propose
a
CNN
specifically
designed
for
such
tasks.
Additionally,
we
also
propose
an
unsupervised
solution
to
the
Authorship
Verification
task
that
fine-tunes
a
pre-trained
deep
language
model
to
compute
a
new
metric
called
{DV-Distance.
The
proposed
metric
is
a
measure
of
the
difference
between
two
authors
that
takes
into
account
the
knowledge
transferred
from
the
pre-trained
model.
Our
design
addresses
the
problem
of
non-comparability
in
authorship
verification,
frequently
encountered
in
small
or
cross-domain
corpora.
To
the
best
of
our
knowledge,
our
work
is
the
first
one
to
introduce
a
method
designed
with
non-comparability
in
mind
from
the
ground
up,
rather
than
indirectly.
It
is
also
one
of
the
first
to
use
Deep
Language
Models
in
this
setting.
The
approach
is
intuitive,
and
it
is
easy
to
understand
and
interpret
through
visualization.
Performance-wise,
our
method
is
significantly
faster
than
much
of
the
competition:
the
winner
of
the
PAN
2015
challenge
has
a
runtime
of
21
hours
44
minutes;
it
takes
our
model
1
minute
to
produce
more
accurate
predictions.
Experiments
on
six
datasets
show
our
approach
matching
or
surpassing
current
state-of-the-art
and
strong
baselines
in
most
tasks.
Both
MASA
and
DV-Distance
have
a
lot
of
room
for
improvement.
We
will
continue
to
improve
and
conduct
additional
analysis/improvement
cycles
on
both
ideas.
We
hope
the
contributions
in
this
work
will
help
both
in
making
advancements
in
these
tasks
as
well
as
gaining
more
insights
into
the
mechanism
of
neural
networks
in
general.
Tuesday,
April
27,
2021
10:00AM
-
12:00PM
CT
Online
via
MS
Teams
Dr. Arjun Mukherjee, dissertation advisor
Faculty, students and the general public are invited.
