[Defense] Preprocessing for Code-Switching Models
Monday, November 22, 2021
3:30 pm - 5:00 pm
will defend her senior honors thesis
Preprocessing for Code-Switching Models
Code-switching is an omnipresent phenomenon in multilingual communities all around the world but remains a challenge for Natural Language Processing (NLP) systems due to the lack of proper data and processing techniques. Hindi-English code-switched text on social media is often transliterated to the Latin script which prevents from utilizing monolingual resources available in the native Devanagari script.
This thesis proposes a method to normalize and back-transliterate code-switched Hindi-English text. In addition, we present a grapheme-to-phoneme (G2P) conversion technique for romanized Hindi data. As part of this project, we also release a dataset of script-corrected Hindi-English code-switched sentences labeled for the named entity recognition and part-of-speech tagging tasks to facilitate further research in this area. The techniques presented in this thesis aim to benefit downstream NLP applications including Named Entity Recognition, speech processing systems, conversational systems, and many more.
3:30PM - 5:00PM CT
Online via Zoom
Dr. Thamar Solorio, thesis advisor
Faculty, students and the general public are invited.