GitHub - atenglens/NMT_Taiwanese_English

Independent Work Spring 2022 at Princeton University

Neural Machine Translation Systems for the Taiwanese-English Text-to-Text Task

Author: Ashley Teng

Adviser: Professor Danqi Chen

Abstract: As a primarily spoken language, Taiwanese is an extremely low-resource language with very little research conducted in natural language processing (NLP) and no published exploration in neural machine translation (NMT). To help preserve the Taiwanese language, we build the first-ever NMT systems for the Taiwanese-English Text-to-Text task. We first create a new dataset by extracting an online Bible in Taiwanese and English to form a parallel corpus consisting of approximately 30,000 sentence pairs. Then, we fine-tune vanilla sequence-to-sequence models with Long Short-Term Memory networks as well as various transformer models on the dataset. Our results show that the best performing model is Marian pre-trained on the Romance languages with BLEU score 31.92. We observe that smaller transformers tend to perform better and increasing a model's size does not necessarily improve performance. We also observe that models pre-trained on Asian languages similar to Taiwanese do not necessarily perform better than models pre-trained on non-Asian languages, and better performance is achieved with a tokenizer that does not assume words are separated by spaces.

New Taiwanese-English dataset: https://huggingface.co/datasets/atenglens/taiwanese_english_translation

Name		Name	Last commit message	Last commit date
Latest commit History 271 Commits
seq2seq		seq2seq
transformer		transformer
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Independent Work Spring 2022 at Princeton University

Neural Machine Translation Systems for the Taiwanese-English Text-to-Text Task

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Independent Work Spring 2022 at Princeton University

Neural Machine Translation Systems for the Taiwanese-English Text-to-Text Task

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages