MCIF is a comprehensive benchmark for evaluating multimodal, multilingual instruction-following systems, which covers 3 modalities (text, speech, and video), 4 languages (English, German, Italian, and Chinese), and 13 tasks (organized into 4 macro-tasks).
A subset of MCIF has been used for the evaluation of the IWSLT 2025 Instruction-Following Shared Task.
2025.10.22: 🤗 MCIF test set is released on HuggingFace
2025.10.21: ⭐️ MCIF Evaluation first release
The evaluation is the core component of this repository. All other components (i.e., dataset construction and baseline inference) are included to ensure full reproducibility and transparency of the evaluation results.
For details on dataset generation or baseline models, please refer to the dedicated READMEs (baselines may require specific dependencies):
-
🧱 Dataset Construction — scripts and guidelines for creating test sets and references → dataset_build/README.md
-
🚀 Baselines — inference scripts and outputs for baseline systems → baselines/README.md
-
📊 Evaluation — scoring and comparison utilities for submitted outputs → README.md
The repository can be installed with pip install -e ..
For the evaluation, you can simply run:
mcif_eval -t {short/long} -l {en/de/it/zh} -s model_outputs.xmlwhere model_outputs.xml contains the outputs of your model for the selected track or context
length (short or long) and target language among English (en), German (de), Italian (it)
and Chinese (zh).
This will automatically download the reference from the Huggingface repository
for the latest MCIF version. If you want to specify a different version, use -v.
To run the evaluation without internet access, first download the MICF references
and then provide them to mcif_eval with the -r parameter.
The file containing the model outputs to evaluate must be structured as follows:
<?xml version='1.0' encoding='utf-8'?>
<testset name="MCIF" type="output">
<task track="{short/long}" text_lang="{en/de/it/zh}">
<sample id="1">{SAMPLE1_CONTENT}</sample>
<sample id="2">{SAMPLE2_CONTENT}</sample>
....
</task>
</testset>To ease usability, we provide a helper function that automatically formats model predictions into the XML structure required by the MCIF evaluation script. The method takes as input:
outputs: a list of tuples (sample_id,prediction) containing the sample id and its related prediction;lang: the target language (en/de/it/zh);track: the context length or track (short/long);output_file: the path to the XML file being created containing all system's outputs, ready for evaluation.
MCIF is released under the Apache 2.0 License.
If you use MCIF in your research, please cite:
@misc{mcif,
title={MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks},
author={Sara Papi and Maike Züfle and Marco Gaido and Beatrice Savoldi and Danni Liu and Ioannis Douros and Luisa Bentivogli and Jan Niehues},
year={2025},
eprint={2507.19634},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.19634},
}