From ad1015f95b32834a7626ead97e400947fff7f984 Mon Sep 17 00:00:00 2001 From: weissenh <50957092+weissenh@users.noreply.github.com> Date: Tue, 2 Dec 2025 16:13:04 +0100 Subject: [PATCH 1/6] Fix instance of reordering bug: 2024.acl-long.796 Introduced by metadata corrections 2025-11-14 Note that the affiliations in XML don't always match with what is found on the PDF. --- data/xml/2024.acl.xml | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/data/xml/2024.acl.xml b/data/xml/2024.acl.xml index dbf7f28fe5..2ae41c8bb9 100644 --- a/data/xml/2024.acl.xml +++ b/data/xml/2024.acl.xml @@ -11132,9 +11132,9 @@ Rifki AfinaPutriKorea Advanced Institute of Science & Technology EmmanuelDaveBinus University JhonsonLeeTokopedia - NuurShadieqBinus University - WawanCenggoroInstitut Teknologi Bandung - Salsabil MaulanaAkbarUniversitas Telkom + NuurShadieqUniversitas Telkom + WawanCenggoroBinus University + Salsabil MaulanaAkbarInstitut Teknologi Bandung Muhammad IhzaMahendraUniversitas Telkom Dea AnnisayantiPutriUniversitas Indonesia BryanWilieHong Kong University of Science and Technology From 02f6a279b5232dd3692e9a1b2234fa90ea5c8bb9 Mon Sep 17 00:00:00 2001 From: weissenh <50957092+weissenh@users.noreply.github.com> Date: Tue, 2 Dec 2025 16:19:03 +0100 Subject: [PATCH 2/6] Fix instance of reordering bug: 2025.arabicnlp-main.26 Introduced by metadata corrections 2025-11-14 --- data/xml/2025.arabicnlp.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/data/xml/2025.arabicnlp.xml b/data/xml/2025.arabicnlp.xml index a5a66272ff..972ac0b194 100644 --- a/data/xml/2025.arabicnlp.xml +++ b/data/xml/2025.arabicnlp.xml @@ -384,8 +384,8 @@ Mind the Gap: A Review of <fixed-case>A</fixed-case>rabic Post-Training Datasets and Their Limitations MohammedAlkhowaiterPrince Sattam bin Abdulaziz University - NorahAlshahraniUniversity of Bisha - SaiedAlshahraniASAS AI + NorahAlshahraniASAS AI + SaiedAlshahraniUniversity of Bisha Reem I.Masoud AlaaAlzahraniKing Salman Global Academy for Arabic DeemaAlnuhaitUniversity of Illinois at Urbana-Champaign From b9e2747113ff7a9e5d97106893e69f51dbe4fbb4 Mon Sep 17 00:00:00 2001 From: weissenh <50957092+weissenh@users.noreply.github.com> Date: Tue, 2 Dec 2025 16:30:11 +0100 Subject: [PATCH 3/6] Fix instance of reordering bug: 2025.arabicnlp-sharedtasks.133 Introduced by metadata corrections 2025-11-14 Pretty meaningless affiliation 'Institute'', but nonetheless. Noticed last author name inconsistent between PDF and metadata: fixed too --- data/xml/2025.arabicnlp.xml | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/data/xml/2025.arabicnlp.xml b/data/xml/2025.arabicnlp.xml index 972ac0b194..848ffc87a4 100644 --- a/data/xml/2025.arabicnlp.xml +++ b/data/xml/2025.arabicnlp.xml @@ -2103,11 +2103,11 @@ Tokenizers United at <fixed-case>QIAS</fixed-case>-2025: <fixed-case>RAG</fixed-case>-Enhanced Question Answering for Islamic Studies by Integrating Semantic Retrieval with Generative Reasoning - MohamedSamyInstitute - MayarBoghdady + MohamedSamy + MayarBoghdadyInstitute MarwanEl Adawi MohamedNassar - Ensaf HusseinMohamed + EnsafHussein 960-965 2025.arabicnlp-sharedtasks.133 From 093594f32584968845f578fd58f43b5a2dd1fc95 Mon Sep 17 00:00:00 2001 From: weissenh <50957092+weissenh@users.noreply.github.com> Date: Tue, 2 Dec 2025 16:36:37 +0100 Subject: [PATCH 4/6] Fix instance of reordering bug: 2025.starsem-1.18 Introduced by metadata corrections 2025-11-14 Affiliation and ORCID put back to correct author --- data/xml/2025.starsem.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/data/xml/2025.starsem.xml b/data/xml/2025.starsem.xml index 069f22eaac..efdc81ba36 100644 --- a/data/xml/2025.starsem.xml +++ b/data/xml/2025.starsem.xml @@ -222,10 +222,10 @@ Latent Traits and Cross-Task Transfer: Deconstructing Dataset Interactions in <fixed-case>LLM</fixed-case> Fine-tuning ShambhaviKrishnaUniversity of Massachusetts at Amherst - AtharvaNaikDepartment of Computer Science, University of Massachusetts at Amherst + AtharvaNaik ChaitaliAgarwal SudharshanGovindan - Haw-ShiuanChang + Haw-ShiuanChangDepartment of Computer Science, University of Massachusetts at Amherst TaesungLee 225-241 Large language models are increasingly deployed across diverse applications. This often includes tasks LLMs have not encountered during training.This implies that enumerating and obtaining the high-quality training data for all tasks is infeasible. Thus, we often need to rely on transfer learning using datasets with different characteristics, and anticipate out-of-distribution requests.Motivated by this practical need, we propose an analysis framework, building a transfer learning matrix and dimensionality reduction, to dissect these cross-task interactions.We train and analyze 10 models to identify latent abilities (e.g., Reasoning, Sentiment Classification, NLU, Arithmetic)and discover the side effects of the transfer learning.Our findings reveal that performance improvements often defy explanations based on surface-level dataset similarity or source data quality. Instead, hidden statistical factors of the source dataset, such as class distribution and generation length proclivities, alongside specific linguistic features, are actually more influential.This work offers insights into the complex dynamics of transfer learning, paving the way for more predictable and effective LLM adaptation. From 7e08d57854d96f39655a4ee63eb6d4b398cb092d Mon Sep 17 00:00:00 2001 From: weissenh <50957092+weissenh@users.noreply.github.com> Date: Wed, 3 Dec 2025 12:56:17 +0100 Subject: [PATCH 5/6] Fix instance of reordering bug: 2025.wmt-1.85 Introduced by metadata corrections 2025-11-14 Affiliations put back to correct author --- data/xml/2025.wmt.xml | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/data/xml/2025.wmt.xml b/data/xml/2025.wmt.xml index 2c57767ca0..b32ee6c58b 100644 --- a/data/xml/2025.wmt.xml +++ b/data/xml/2025.wmt.xml @@ -1177,20 +1177,20 @@ DineshTewariGoogle Baba MamadiDianeNKO USA INC DjibrilaDianeNKO USA INC - Solo FarabadoCisséStanford University - Koulako MoussaDoumbouyaNKO USA INC + Solo FarabadoCisséNKO USA INC + Koulako MoussaDoumbouyaStanford University EdoardoFerranteConseggio pe-o patrimonio linguistico ligure AlessandroGuasoniConseggio pe-o patrimonio linguistico ligure - ChristopherHomanPaair Institute - Mamadou K.KeitaNIT, Arunachal Pradesh - SudhamoyDebBarmatyvan.ru - AliKuzhugetStanford University - DavidAnugrahaUniversitas Indonesia - Muhammad RaviShulthan HabibiUniversity of Zurich - SinaAhmadiGoogle - AnthonyMunthaliGoogle - Jonathan MingfeiLiu - JonathanEng + ChristopherHoman + Mamadou K.KeitaPaair Institute + SudhamoyDebBarmaNIT, Arunachal Pradesh + AliKuzhugettyvan.ru + DavidAnugrahaStanford University + Muhammad RaviShulthan HabibiUniversitas Indonesia + SinaAhmadiUniversity of Zurich + AnthonyMunthali + Jonathan MingfeiLiuGoogle + JonathanEngGoogle 1103-1123 We open-source SMOL (Set of Maximal Over-all Leverage), a suite of training data to un-lock machine translation for low-resource languages (LRLs). SMOL has been translated into123 under-resourced languages (125 language pairs), including many for which there exist no previous public resources, for a total of 6.1M translated tokens. SMOL comprises two sub-datasets, each carefully chosen for maximum impact given its size: SMOLSENT, a set of sentences chosen for broad unique token coverage, and SMOLDOC, a document-level source focusing on a broad topic coverage. They join the already released GATITOS for a trifecta of paragraph, sentence, and token-level content. We demonstrate that using SMOL to prompt or fine-tune Large Language Models yields robust chrF improvements. In addition to translation, we provide factuality ratings and rationales for all documents in SMOLDOC, yielding the first factuality datasets for most of these languages. 2025.wmt-1.85 From b1ce84b8e9ad7cb3be68edbffebed3f38f4c5be7 Mon Sep 17 00:00:00 2001 From: weissenh <50957092+weissenh@users.noreply.github.com> Date: Wed, 3 Dec 2025 13:04:29 +0100 Subject: [PATCH 6/6] Fix instance of reordering bug: 2025.emnlp-main.1435 Introduced by metadata corrections 2025-11-14 Affiliations and orcid put back to correct author --- data/xml/2025.emnlp.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/data/xml/2025.emnlp.xml b/data/xml/2025.emnlp.xml index 111a52800d..2292e504cc 100644 --- a/data/xml/2025.emnlp.xml +++ b/data/xml/2025.emnlp.xml @@ -21230,8 +21230,8 @@ AlokaFernandoUniversity of Moratuwa Nisansade SilvaUniversity of Moratuwa MenanVelayuthanUniversity of Moratuwa - CharithaRathnayakeMassey University - SurangikaRanathungaUniversity of Moratuwa + CharithaRathnayakeUniversity of Moratuwa + SurangikaRanathungaMassey University 28252-28269 Parallel Data Curation (PDC) techniques aim to filter out noisy parallel sentences from web-mined corpora. Ranking sentence pairs using similarity scores on sentence embeddings derived from Pre-trained Multilingual Language Models (multiPLMs) is the most common PDC technique. However, previous research has shown that the choice of the multiPLM significantly impacts the quality of the filtered parallel corpus, and the Neural Machine Translation (NMT) models trained using such data show a disparity across multiPLMs. This paper shows that this disparity is due to different multiPLMs being biased towards certain types of sentence pairs, which are treated as noise from an NMT point of view. We show that such noisy parallel sentences can be removed to a certain extent by employing a series of heuristics. The NMT models, trained using the curated corpus, lead to producing better results while minimizing the disparity across multiPLMs. We publicly release the source code and the curated datasets 2025.emnlp-main.1435