This repository contains tools for processing Arabic data through transliteration with a Maltese-centric focus.
A small demo is provided to apply transliteration of Arabic data with different methods.
Otherwise, to transliterate an entire dataset, follow these steps:
- If the source data is in a different format than the target data, reformat it as the target data.
The data is stored in some
$INPUT_DIRECTORY. - Clean the text using arclean. The transformation code does this automatically as well, but this step is necessary if you are not transforming the source Arabic data further in the following steps.
- Transform the data from above with the appropriate method.
This step will generate a file for each split in the dataset in some
$OUTPUT_PATH.
Scripts are provided to process an entire dataset with accordingly. The following arguments are shared across scripts:
- The input data, which is compatible with Hugging Face
datasets. When passing--train_file(&--validation_file/--test_file), these should be JSON/CSV files. - A
--text_columnargument, specifying the field to transform, ignoring the rest of the data. - An
--output_pathargument to specify in which directory the transformed data is persisted.
The following table provides a breakdown of the arguments to be specified for each dataset.
| Dataset | $INPUT_DATA_ARGS |
$TEXT_COLUMN |
|---|---|---|
| Sentiment Analysis | --train_file="$DATA_PATH/sentiment_analysis/train.csv" --dataset_kwargs="{\"names\": [\"label\", \"text\"]}" |
"text" |
| ANERCorp | --train_file="$DATA_PATH/ANERcorp-CamelLabSplits/train.json" |
"tokens" |
These arguments are referred to as $DATASET_ARGS below, and have the following format: $INPUT_DATA_ARGS --text_column=$TEXT_COLUMN --output_path=$OUTPUT_PATH.
Further details on how to transform the data with each method is given below:
MorphTx
To run the full system with diacritisation and morpheme-level mappings:python rules/transliterate.py $DATASET_ARGS \
--model_name="egy" --morphology_database="$CAMEL_TOOLS_PATH/data/morphology_db/calima-egy-c044.db"CharTx
To run the simple system with only character-level mappings:python rules/transliterate.py $DATASET_ARGS \
--no_diacritisationBuckwalter
To run Buckwalter (lower-cased):
python baseline/transliterate.py $DATASET_ARGS \
--scheme=buckwalterUroman
To run Uroman:
python baseline/transliterate.py $DATASET_ARGS \
--scheme=uromanThis work was introduced in Data Augmentation for Maltese NLP using Transliterated and Machine Translated Arabic Data. Cite as follows:
@inproceedings{micallef-etal-2025-maltification,
title = "Data Augmentation for {M}altese {NLP} using Transliterated and Machine Translated {A}rabic Data",
author = "Micallef, Kurt and
Habash, Nizar and
Borg, Claudia",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.1177/",
doi = "10.18653/v1/2025.findings-emnlp.1177",
pages = "21580--21590",
ISBN = "979-8-89176-335-7",
}