This code is released as part of our EACL paper on Robustification of Multilingual Language Models to Real-world Noise with Robust Contrastive Pretraining (official link coming up).
If you use code/data in this repository, you will have to cite the following work:
@proceedings{eacl-2023-asa-sailik,
title = "Robustification of Multilingual Language Models to Real-world Noise with Robust Contrastive Pretraining",
author = {Stickland, Asa Cooper and Sengupta, Sailik and Krone, Jason and Mansour, Saab and He, He},
booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
year = "2023",
publisher = "Association for Computational Linguistics"
}
As the code-base has several code paths (eg. wikipedia data downloading, model pre-training, joint ic-sl evaluation, xnli, ner evaluation), we do not provide a single requriements.txt file with all dependencies. We suggest the user to download dependencies as and when required and provide a few basic building block package installations:
- python>=3.6
- torch==1.6.0
- transformers==3.0.2
- seqeval==0.0.12
- pytorch-crf==0.7.2
The code base is build on the shoulder of other code-bases. Licenses for these code bases can be found inside THIRD_PARTY_LICENSES.md. Any amendments made to the code in this code-base is licensed as per LICENSE.
The data inside paper_data have licenses of their own (which overrides the aforementioned license). More information about the individual licenses for the data can found in this README.md file.
Run scripts can be found inside the runner_scripts directory.