-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check if all docs have domain attribute #267
Conversation
Thanks for finding the source of the CI errors. I agree it is our priority to fix them, so we should perhaps merge this as soon as possible. |
Thank you for the detailed explanation. I will review the branch you have been working on. |
…e pairs When omitting `-l`, `--list` will still print all the language pairs for that test set. Motivation: Originally, `--list` showed just the list of language pairs, so there was no reason to call it with `-l`, but now it lists all the **fields** for a given language pair and it is relatively slow (it has to parse the XML files), so it makes sense to restrict the listing to a single language pair only.
7d1ca89
to
163b594
Compare
Thank you for merging my commits. Please, fix also the failing test, i.e. add |
I'm trying to fix the errors, but it might take some time. |
Thank you for your advice. I was able to pass the tests by adding |
e749ff8
to
98dbe42
Compare
98dbe42
to
2a4cdde
Compare
The CI errors are fixed. Could you please review this PR? |
Thank you very much, once again. The last commits were exactly the missing pieces which prevented me to finish the PR in March. I am happy I could merge it into the master now. |
My pleasure. I’m also glad that this PR was finally merged. |
I looked through the newest CI errors and could reproduce them locally. The failure of
test_dataset::test_process_to_text
occurs whenwmt22
is chosen (since this test randomly selects 10 datasets, the failure doesn’t always happen). The failure is because the domain files are shorter than the other files in some language pairs. For example,ru-en
has this problem:Then, I checked the original file at https://github.com/wmt-conference/wmt22-news-systems/blob/main/xml/wmttest2022.ru-en.all.xml, and found that not all
doc
elements have thedomain
attribute._unwrap_wmt21_or_later
should check if the length of domains matches the lengths of the other data.