morph-tok-eval

Replicating the paper experiments

Install the dependencies using requirements.txt.

This repository uses Snakemake, a Python extension of Makefile, to manage the experiments.

You can run the all the experiments (including downloading CC100 data and training the tokenizers) by running:

snakemake --executor local --cores all

This will run all computations locally while using all available CPU cores. Snakemake also support paralleliziation in cluster environment. For this, you might need to adjust the specifications of computational resources of the Snakemake rules to match your cluster specifics.

The Snakemake steps are:

Download CC100 datasets for the studied languages (rule download_cc100).
Train BPE, WordPiece and Unigram tokenizers using the Huggingface tokenizers library (rule train_tokenizer).
Tokenize word lists using the trained tokenizers (rule tokenize_unimorph_our_tokenizer) and with character-level (rule character_tokenization) segmentation and gold segmentation (rule gold_tokenization).
Evaluate all tokenizations using boundary precision and recall and different variants of the proposed metric (rule evaluate_segmentation).
Compute the correlations between the alignment metric and precision boundary and recall (rule compute_correlations).

After running all experiments, the results are in directory correlations.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
data/morpho		data/morpho
legros @ 224af38		legros @ 224af38
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
Snakefile		Snakefile
align.py		align.py
data.py		data.py
dynamic_program_segment.py		dynamic_program_segment.py
get_vocabulary.py		get_vocabulary.py
plot.py		plot.py
pos_correlation.py		pos_correlation.py
pos_tagger.py		pos_tagger.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

morph-tok-eval

Replicating the paper experiments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

morph-tok-eval

Replicating the paper experiments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages