[Defense] Preprocessing for Code-Switching Models
Monday, November 22, 2021
3:30 pm - 5:00 pm
In
Partial
Fulfillment
of
the
Requirements
for
the
Bachelor
of
Science
Dwija
Parikh
will
defend
her
senior
honors
thesis
Preprocessing
for
Code-Switching
Models
Abstract
Code-switching is an omnipresent phenomenon in multilingual communities all around the world but remains a challenge for Natural Language Processing (NLP) systems due to the lack of proper data and processing techniques. Hindi-English code-switched text on social media is often transliterated to the Latin script which prevents from utilizing monolingual resources available in the native Devanagari script.
This thesis proposes a method to normalize and back-transliterate code-switched Hindi-English text. In addition, we present a grapheme-to-phoneme (G2P) conversion technique for romanized Hindi data. As part of this project, we also release a dataset of script-corrected Hindi-English code-switched sentences labeled for the named entity recognition and part-of-speech tagging tasks to facilitate further research in this area. The techniques presented in this thesis aim to benefit downstream NLP applications including Named Entity Recognition, speech processing systems, conversational systems, and many more.
Monday,
November
22,
2021
3:30PM
-
5:00PM
CT
Online
via
Zoom
Dr. Thamar Solorio, thesis advisor
Faculty, students and the general public are invited.
