Retraining OpenLID with all the new data and changes

cd baselines/scripts/retrain_openlid

OpenLID pipeline

Following OpenLID's instructions (be cautious, they were not fully up-to-date), the pipeline is as follows:

That folder contains mostly OpenLID author's scripts with minor changes.

For the baseline family-specific language models, I recommend to run from step 4, skipping cleaning (so that their sampling is not influenced by some minor languages and we don't loose the data). The current cleaning is language-independent, so you can just take /scratch/project_465002259/eurolid/02-11/openlid_stage2_prep.fasttext and take only the languages of interest from there.

Find additional data and format by the scheme <text>\t<language>\t<source>. If it is an addition to an existing language, it can be appended to it either from a *.parquet or *.tsv using the script append_to_openlid_parquet.py. If the data are for a new language, just convert to a parquet.
Data for all languages must be in the same directory.
The most recent data (added for some languages, ara_Arab and fas_Arab merged, lat_Latn added, srp_Latn added, zxx_Zxxx added) are at /scratch/project_465002259/eurolid/02-11-data/.
Cleaning, deduplication, up/downsampling, writing to FastText format and shuffling are done by make_training_openlid.py. I was able to run that script on my laptop with only 16 GB of memory, except shuffling. If you fail on memory when shuffling, run shuf.sh on LUMI.

When running from scratch, the command is

python3 make_training_openlid.py <output_dir> --data_dir <data_dir>

If the output of stage 2 from make_training_openlid.py, named openlid_stage2_prep.fasttext, is in <data_dir> directory and contains only languages of interest, the command to run preprocessing will be:

python3 make_training_openlid.py <output_dir> --skip_clean --skip_sort

The training on LUMI is run by lid.sh. Don't forget to pass a new path to data/saved model instead of the default one. The hyperparameters are the same as in OpenLID.

Adding GlotLID data

The data on LUMI are in /scratch/project_465001890/eurolid/glotlid-corpus/.

It is also possible download them using the script download_glotlid.py.

make_list_of_glotlid_sources.py creates the list of GlotLID sources for each language and shows number of samples in GlotLID data. There is no need to run it, since the resulting list is in other.tsv in the root of this repository.

The script add_from_glotlid.py shows how to select only the data sources that are of reliable quality and not proprietary. (Beware of hardcoded paths...) The list of filters there is also for the languages we worked with before; for Scandinavian etc., if there are some other sources, check their quality and license according to GlotLID list. We also collected licenses of the sources we used here at LangID sources sheet.

That script also ensures that wikipedia GlotLID data do not intersect with OpenLID wikipedia data.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
baselines		baselines
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
other.tsv		other.tsv
sources.txt		sources.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Retraining OpenLID with all the new data and changes

OpenLID pipeline

Adding GlotLID data

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

hplt-project/mtm25-langid

Folders and files

Latest commit

History

Repository files navigation

Retraining OpenLID with all the new data and changes

OpenLID pipeline

Adding GlotLID data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages