Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if all docs have domain attribute #267

Merged
merged 5 commits into from
Jul 23, 2024

Conversation

jkawamoto
Copy link
Contributor

I looked through the newest CI errors and could reproduce them locally. The failure of test_dataset::test_process_to_text occurs when wmt22 is chosen (since this test randomly selects 10 datasets, the failure doesn’t always happen). The failure is because the domain files are shorter than the other files in some language pairs. For example, ru-en has this problem:

$ pwd
~/.sacrebleu/wmt22

$ wc -l wmt22.ru-en.domain wmt22.ru-en.src wmt22.ru-en.ALMAnaCH-Inria
    1512 wmt22.ru-en.domain
    2016 wmt22.ru-en.src
    2016 wmt22.ru-en.ALMAnaCH-Inria
    5544 total

Then, I checked the original file at https://github.com/wmt-conference/wmt22-news-systems/blob/main/xml/wmttest2022.ru-en.all.xml, and found that not all doc elements have the domain attribute.

_unwrap_wmt21_or_later should check if the length of domains matches the lengths of the other data.

@martinpopel
Copy link
Collaborator

Thanks for finding the source of the CI errors. I agree it is our priority to fix them, so we should perhaps merge this as soon as possible.
That said, I think I have fixed this already in March in a more general way.
Unfortunately, I didn't have time to polish it into a PR and since then @mjpost pushed another solution (not so general) and I have forgotten all the details.
I am still too busy to look at this, but if there is a volunteer, please try to merge (and review) https://github.com/martinpopel/sacreBLEU/tree/xml-domains

@martinpopel martinpopel requested a review from mjpost June 22, 2024 22:18
@jkawamoto
Copy link
Contributor Author

Thank you for the detailed explanation. I will review the branch you have been working on.

…e pairs

When omitting `-l`, `--list` will still print all the language pairs for that test set.

Motivation:
Originally, `--list` showed just the list of language pairs, so there was no reason
to call it with `-l`, but now it lists all the **fields** for a given language pair
and it is relatively slow (it has to parse the XML files), so it makes sense
to restrict the listing to a single language pair only.
@martinpopel
Copy link
Collaborator

Thank you for merging my commits. Please, fix also the failing test, i.e. add domain=ALL to the expected output.

@jkawamoto jkawamoto marked this pull request as draft July 17, 2024 17:27
@jkawamoto
Copy link
Contributor Author

I'm trying to fix the errors, but it might take some time.

@martinpopel
Copy link
Collaborator

I would hope it is a matter of adding domain=ALL to the two lines here and here (or alternatively prevent the printing). Unfortunately, I am not capable to try it now.

@jkawamoto
Copy link
Contributor Author

Thank you for your advice. I was able to pass the tests by adding domain=ALL to the four lines and removing a couple of lines. I’m not sure if removing those lines is acceptable. I’ll look for another way to solve the CI failures.

@jkawamoto jkawamoto marked this pull request as ready for review July 23, 2024 20:41
@jkawamoto
Copy link
Contributor Author

The CI errors are fixed. Could you please review this PR?

@martinpopel martinpopel merged commit dc301a5 into mjpost:master Jul 23, 2024
19 checks passed
@martinpopel
Copy link
Collaborator

Thank you very much, once again. The last commits were exactly the missing pieces which prevented me to finish the PR in March. I am happy I could merge it into the master now.

@jkawamoto jkawamoto deleted the fix-domain-issue branch July 23, 2024 21:16
@jkawamoto
Copy link
Contributor Author

My pleasure. I’m also glad that this PR was finally merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants