Maltification of Arabic

This repository contains tools for processing Arabic data through transliteration with a Maltese-centric focus.

Arabic Data Processing

A small demo is provided to apply transliteration of Arabic data with different methods.

Otherwise, to transliterate an entire dataset, follow these steps:

If the source data is in a different format than the target data, reformat it as the target data. The data is stored in some $INPUT_DIRECTORY.
Clean the text using arclean. The transformation code does this automatically as well, but this step is necessary if you are not transforming the source Arabic data further in the following steps.
Transform the data from above with the appropriate method. This step will generate a file for each split in the dataset in some $OUTPUT_PATH.

Arabic Data Transformation

Scripts are provided to process an entire dataset with accordingly. The following arguments are shared across scripts:

The input data, which is compatible with Hugging Face datasets. When passing --train_file (& --validation_file/--test_file), these should be JSON/CSV files.
A --text_column argument, specifying the field to transform, ignoring the rest of the data.
An --output_path argument to specify in which directory the transformed data is persisted.

The following table provides a breakdown of the arguments to be specified for each dataset.

Dataset	`$INPUT_DATA_ARGS`	`$TEXT_COLUMN`
Sentiment Analysis	`--train_file="$DATA_PATH/sentiment_analysis/train.csv" --dataset_kwargs="{\"names\": [\"label\", \"text\"]}"`	`"text"`
ANERCorp	`--train_file="$DATA_PATH/ANERcorp-CamelLabSplits/train.json"`	`"tokens"`

These arguments are referred to as $DATASET_ARGS below, and have the following format: $INPUT_DATA_ARGS --text_column=$TEXT_COLUMN --output_path=$OUTPUT_PATH. Further details on how to transform the data with each method is given below:

MorphTx

To run the full system with diacritisation and morpheme-level mappings:

python rules/transliterate.py $DATASET_ARGS \
    --model_name="egy" --morphology_database="$CAMEL_TOOLS_PATH/data/morphology_db/calima-egy-c044.db"

CharTx

To run the simple system with only character-level mappings:

python rules/transliterate.py $DATASET_ARGS \
    --no_diacritisation

Buckwalter

To run Buckwalter (lower-cased):

python baseline/transliterate.py $DATASET_ARGS \
    --scheme=buckwalter

Uroman

To run Uroman:

python baseline/transliterate.py $DATASET_ARGS \
    --scheme=uroman

Citation

This work was introduced in Data Augmentation for Maltese NLP using Transliterated and Machine Translated Arabic Data. Cite as follows:

@inproceedings{micallef-etal-2025-maltification,
    title = "Data Augmentation for {M}altese {NLP} using Transliterated and Machine Translated {A}rabic Data",
    author = "Micallef, Kurt  and
      Habash, Nizar  and
      Borg, Claudia",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.1177/",
    doi = "10.18653/v1/2025.findings-emnlp.1177",
    pages = "21580--21590",
    ISBN = "979-8-89176-335-7",
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
baseline		baseline
rules		rules
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.ipynb		demo.ipynb
fertility.ipynb		fertility.ipynb
process.py		process.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Maltification of Arabic

Arabic Data Processing

Arabic Data Transformation

Citation

About

Uh oh!

Releases

Packages

Languages

License

MLRS/maltify_arabic

Folders and files

Latest commit

History

Repository files navigation

Maltification of Arabic

Arabic Data Processing

Arabic Data Transformation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages