Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different scores from different COMET package versions 1.1.2 and 2.2.1 #203

Open
PinzhenChen opened this issue Feb 27, 2024 · 2 comments
Open
Labels
bug Something isn't working

Comments

@PinzhenChen
Copy link

PinzhenChen commented Feb 27, 2024

🐛 Bug

When the same source, target, reference files are evaluated using the same wmt22-comet-da checkpoint, unbabel-comet 2.2.1 under python3.9 and unbabel-comet 1.1.2 under python3.7 gave me dramatically different numbers.

To Reproduce

In python3.7, pip install --upgrade unbabel-comet gives 1.1.2 as the latest version, while in python3.9 it gives 2.2.1.

Scoring the same source, target, and reference files under the above two environments gave different scores. unbabel-comet 1.1.2 results in a score of 0.86 while the 2.2.1 version gave 0.79. I used WMT22-COMET-DA downloaded from Hugging Face https://huggingface.co/Unbabel/wmt22-comet-da.

Attaching the files which gave 0.79 and 0.86 below, but I think any file combination can be used to reproduce this behaviour since it's associated with the COMET package version.
target.en.txt
source.mt.txt
hypothesis.en.txt

Expected behaviour

I would expect different COMET package versions to give the same score if the same checkpoint and files are given.

Environment

Managed python3.7 and python3.9 with conda.

Additional context

If there is indeed some package mismatch between unbabel-comet 1.1.2 and 2.2.1, it might be difficult to go back and fix the problem. Users probably are unaware of this and will not update. Moreover, python3.7 only supports 1.1.2 as the latest even if users upgrade COMET in python3.7. Maybe this behaviour can be highlighted in README to encourage the user to use specific Python and unbabel-comet versions . On the other hand, this could imply that research papers should report COMET package version in addition to COMET version. Would it be possible to implement some kind of COMET signature just like that in sacrebleu?

@BramVanroy
Copy link
Contributor

BramVanroy commented Mar 2, 2024

This confirms what we learnt for BLEU, too: one should ALWAYS report version numbers (signatures), also for COMET!

Side note: in my MATEO, I added a custom signature for neural metrics like bertscore, bleurt and comet, too. For COMET it looks like this (inspired by sacrebleu):

comet: nrefs:1|bs:1000|seed:12345|c:Unbabel/wmt22-comet-da|version:2.0.1|mateo:1.1.3

where c stands for the checkpoint used and version is self-explanatory. Wasn't sure how far one had to go with this because difference in torch, cuda and transformers versions may or may not also lead to difference in results. Hell, even then the CUDA optimisation might lead to different results on different hardware.

@PinzhenChen
Copy link
Author

Admittedly the README currently says it requires 3.8, so maybe I installed COMET in the stone age and pip install —upgrade unbabel-comet never warned me. Anyway I think the score mismatch should not be expected

Your signature is very thoughtful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants