Skip to content

MLRS/maltify_arabic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Maltification of Arabic

This repository contains tools for processing Arabic data through transliteration with a Maltese-centric focus.

Arabic Data Processing

A small demo is provided to apply transliteration of Arabic data with different methods.

Otherwise, to transliterate an entire dataset, follow these steps:

  1. If the source data is in a different format than the target data, reformat it as the target data. The data is stored in some $INPUT_DIRECTORY.
  2. Clean the text using arclean. The transformation code does this automatically as well, but this step is necessary if you are not transforming the source Arabic data further in the following steps.
  3. Transform the data from above with the appropriate method. This step will generate a file for each split in the dataset in some $OUTPUT_PATH.

Arabic Data Transformation

Scripts are provided to process an entire dataset with accordingly. The following arguments are shared across scripts:

  • The input data, which is compatible with Hugging Face datasets. When passing --train_file (& --validation_file/--test_file), these should be JSON/CSV files.
  • A --text_column argument, specifying the field to transform, ignoring the rest of the data.
  • An --output_path argument to specify in which directory the transformed data is persisted.

The following table provides a breakdown of the arguments to be specified for each dataset.

Dataset $INPUT_DATA_ARGS $TEXT_COLUMN
Sentiment Analysis --train_file="$DATA_PATH/sentiment_analysis/train.csv" --dataset_kwargs="{\"names\": [\"label\", \"text\"]}" "text"
ANERCorp --train_file="$DATA_PATH/ANERcorp-CamelLabSplits/train.json" "tokens"

These arguments are referred to as $DATASET_ARGS below, and have the following format: $INPUT_DATA_ARGS --text_column=$TEXT_COLUMN --output_path=$OUTPUT_PATH. Further details on how to transform the data with each method is given below:

MorphTx To run the full system with diacritisation and morpheme-level mappings:
python rules/transliterate.py $DATASET_ARGS \
    --model_name="egy" --morphology_database="$CAMEL_TOOLS_PATH/data/morphology_db/calima-egy-c044.db"
CharTx To run the simple system with only character-level mappings:
python rules/transliterate.py $DATASET_ARGS \
    --no_diacritisation
Buckwalter

To run Buckwalter (lower-cased):

python baseline/transliterate.py $DATASET_ARGS \
    --scheme=buckwalter
Uroman

To run Uroman:

python baseline/transliterate.py $DATASET_ARGS \
    --scheme=uroman

Citation

This work was introduced in Data Augmentation for Maltese NLP using Transliterated and Machine Translated Arabic Data. Cite as follows:

@inproceedings{micallef-etal-2025-maltification,
    title = "Data Augmentation for {M}altese {NLP} using Transliterated and Machine Translated {A}rabic Data",
    author = "Micallef, Kurt  and
      Habash, Nizar  and
      Borg, Claudia",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.1177/",
    doi = "10.18653/v1/2025.findings-emnlp.1177",
    pages = "21580--21590",
    ISBN = "979-8-89176-335-7",
}

About

Data processing tools for transforming Arabic data into Maltese.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published