cd baselines/scripts/retrain_openlid
Following OpenLID's instructions (be cautious, they were not fully up-to-date), the pipeline is as follows:
That folder contains mostly OpenLID author's scripts with minor changes.
For the baseline family-specific language models, I recommend to run from step 4, skipping cleaning (so that their sampling is not influenced by some minor languages and we don't loose the data). The current cleaning is language-independent, so you can just take /scratch/project_465002259/eurolid/02-11/openlid_stage2_prep.fasttext and take only the languages of interest from there.
-
Find additional data and format by the scheme
<text>\t<language>\t<source>. If it is an addition to an existing language, it can be appended to it either from a *.parquet or *.tsv using the scriptappend_to_openlid_parquet.py. If the data are for a new language, just convert to a parquet. -
Data for all languages must be in the same directory.
-
The most recent data (added for some languages, ara_Arab and fas_Arab merged, lat_Latn added, srp_Latn added, zxx_Zxxx added) are at
/scratch/project_465002259/eurolid/02-11-data/. -
Cleaning, deduplication, up/downsampling, writing to FastText format and shuffling are done by
make_training_openlid.py. I was able to run that script on my laptop with only 16 GB of memory, except shuffling. If you fail on memory when shuffling, runshuf.shon LUMI.
When running from scratch, the command is
python3 make_training_openlid.py <output_dir> --data_dir <data_dir>
If the output of stage 2 from make_training_openlid.py, named openlid_stage2_prep.fasttext, is in <data_dir> directory and contains only languages of interest,
the command to run preprocessing will be:
python3 make_training_openlid.py <output_dir> --skip_clean --skip_sort
- The training on LUMI is run by
lid.sh. Don't forget to pass a new path to data/saved model instead of the default one. The hyperparameters are the same as in OpenLID.
The data on LUMI are in /scratch/project_465001890/eurolid/glotlid-corpus/.
It is also possible download them using the script download_glotlid.py.
make_list_of_glotlid_sources.py creates the list of GlotLID sources for each language and shows number of samples in GlotLID data.
There is no need to run it, since the resulting list is in other.tsv in the root of this repository.
The script add_from_glotlid.py shows how to select only the data sources that are of reliable quality and not proprietary. (Beware of hardcoded paths...)
The list of filters there is also for the languages we worked with before;
for Scandinavian etc., if there are some other sources, check their quality and license according to GlotLID list.
We also collected licenses of the sources we used here at LangID sources sheet.
That script also ensures that wikipedia GlotLID data do not intersect with OpenLID wikipedia data.