This repository accompanies our preprint: "CantoNLU: A benchmark for Cantonese natural language understanding", where we introduce a general language understanding benchmark in Cantonese, in a collaboration with York Hay Ng, Sophia Chan, Helena Zhao, and Annie En-Shiun Lee.
This repository requires a local copy of a BERT model and Wikipedia dataset to run.
To download the resources, simply run
python download.py --lang=yue
where lang can be yue or wuu.
To continually pre-train on Mandarin BERT, simply run
python run.py --pretrain --lang=yue
where lang can be yue or wuu.
Additional flags are available--see run.py.
To fine-tune on POS and DEPS, the code requires the Cantonese UD file.
Download the CoNLL-U file
and place it in data/, then use the conllu_2_pos_dataset() function in utils.py.
The monolingual and transfer models are available at the following Google Drive links.