Skip to content

Commit 79e38a7

Browse files
authored
Add WMT 2024 test sets (#276)
1 parent c89d62e commit 79e38a7

File tree

3 files changed

+95
-63
lines changed

3 files changed

+95
-63
lines changed

CHANGELOG.md

+74-62
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,47 @@
11
# Release Notes
22

3-
- 2.4.2 (2024-04-12)
3+
## 2.5.0 (2025-01-03)
4+
5+
### Added
6+
7+
- WMT24 test sets
8+
9+
### Fixed
10+
11+
- Convert Changelog to markdown format
12+
- Add optimization for compute_bleu precision initialization (#257)
13+
Thanks to Ernests Lavrinovics for this contribution.
14+
15+
## 2.4.2 (2024-04-12)
416
Added:
517
- Add printing of domain if present (via --echo)
618

7-
- 2.4.1 (2024-03-12)
19+
## 2.4.1 (2024-03-12)
820
Fixed:
921
- Add exports to package __init__.py
1022

11-
- 2.4.0 (2023-12-11)
23+
## 2.4.0 (2023-12-11)
1224
Added:
1325
- WMT23 test sets (test set `wmt23`)
1426

15-
- 2.3.3 (2023-11-28)
27+
## 2.3.3 (2023-11-28)
1628
Fixed:
1729
- Typing issues (#249, #250)
1830
- Improved builds (#252)
1931

20-
- 2.3.2 (2023-11-06)
32+
## 2.3.2 (2023-11-06)
2133
Fixed:
2234
- Special treatment of empty references in TER (#232)
2335
- Bump in mecab version for JA (#234)
2436

2537
Added:
2638
- Warning if `-tok spm` is used (use explicit `flores101` instead) (#238)
2739

28-
- 2.3.1 (2022-10-18)
40+
## 2.3.1 (2022-10-18)
2941
Bugfix:
3042
- Set lru_cache to 2^16 for SPM tokenizer (was set to infinite)
3143

32-
- 2.3.0 (2022-10-18)
44+
## 2.3.0 (2022-10-18)
3345
Features:
3446
- (#203) Added `-tok flores101` and `-tok flores200`, a.k.a. `spbleu`.
3547
These are multilingual tokenizations that make use of the
@@ -44,10 +56,10 @@
4456
- System outputs: include with wmt22. Also added wmt21/systems which will produce WMT21 submitted systems.
4557
To see available systems, give a dummy system to `--echo`, e.g., `sacrebleu -t wmt22 -l en-de --echo ?`
4658

47-
- 2.2.1 (2022-09-13)
59+
## 2.2.1 (2022-09-13)
4860
Bugfix: Standard usage was returning (and using) each reference twice.
4961

50-
- 2.2.0 (2022-07-25)
62+
## 2.2.0 (2022-07-25)
5163
Features:
5264
- Added WMT21 datasets (thanks to @BrighXiaoHan)
5365
- `--echo` now exposes document metadata where available (e.g., docid, genre, origlang)
@@ -65,7 +77,7 @@
6577
Many thanks to @BrightXiaoHan (https://github.com/BrightXiaoHan) for the bulk of
6678
the code contributions in this release.
6779

68-
- 2.1.0 (2022-05-19)
80+
## 2.1.0 (2022-05-19)
6981
Features:
7082
- Added `-tok spm` for multilingual SPM tokenization (#168)
7183
(thanks to Naman Goyal and James Cross at Facebook)
@@ -75,7 +87,7 @@
7587
- Bugfix: BLEU.corpus_score() now using max_ngram_order (#173)
7688
- Upgraded ja-mecab to 1.0.5 (#196)
7789

78-
- 2.0.0 (2021-07-18)
90+
## 2.0.0 (2021-07-18)
7991
- Build: Add Windows and OS X testing to Travis CI.
8092
- Improve documentation and type annotations.
8193
- Drop `Python < 3.6` support and migrate to f-strings.
@@ -137,11 +149,11 @@
137149
as well as paired bootstrap resampling (`--paired-bs`) and paired approximate
138150
randomization tests (`--paired-ar`) when evaluating multiple systems (#40 and #78).
139151

140-
- 1.5.1 (2021-03-05)
152+
## 1.5.1 (2021-03-05)
141153
- Fix extraction error for WMT18 extra test sets (test-ts) (#142)
142154
- Validation and test datasets are added for multilingual TEDx
143155

144-
- 1.5.0 (2021-01-15)
156+
## 1.5.0 (2021-01-15)
145157
- Fix an assertion error in chrF (#121)
146158
- Add missing `__repr__()` methods for BLEU and TER
147159
- TER: Fix exception when `--short` is used (#131)
@@ -155,7 +167,7 @@
155167
- Allow variable number of references for BLEU (only via API) (#130).
156168
Thanks to Ondrej Dusek (@tuetschek)
157169

158-
- 1.4.14 (2020-09-13)
170+
## 1.4.14 (2020-09-13)
159171
- Added character-based tokenization (`-tok char`).
160172
Thanks to Christian Federmann.
161173
- Added TER (`-m ter`). Thanks to Ales Tamchyna! (fixes #90)
@@ -166,18 +178,18 @@
166178
- wmt20/robust/set2 (en-ja, ja-en)
167179
- wmt20/robust/set3 (de-en)
168180

169-
- 1.4.13 (2020-07-30)
181+
## 1.4.13 (2020-07-30)
170182
- Added WMT20 newstest test sets (#103)
171183
- Make mecab3-python an extra dependency, adapt code to new mecab3-python
172184
This fixes the recent Windows installation issues as well (#104)
173185
Japanese support should now be explicitly installed through sacrebleu[ja] package.
174186
- Fix return type annotation of corpus_bleu()
175187
- Improve sentence_score's documentation, do not allow single ref string (#98)
176188

177-
- 1.4.12 (2020-07-03)
189+
## 1.4.12 (2020-07-03)
178190
- Fix a deployment bug (#96)
179191

180-
- 1.4.11 (2020-07-03)
192+
## 1.4.11 (2020-07-03)
181193
- Added Multi30k multimodal MT test set metadata
182194
- Refactored all tokenizers into respective classes (fixes #85)
183195
- Refactored all metrics into respective classes
@@ -193,21 +205,21 @@
193205
- Added score regression tests for chrF using reference chrF++ implementation
194206
- Added multi-reference & tokenizer & signature tests
195207

196-
- 1.4.10 (2020-05-30)
208+
## 1.4.10 (2020-05-30)
197209
- Fixed bug in signature with mecab tokenizer
198210
- Cleaned up deprecation warnings (thanks to Karthikeyan Singaravelan @tirkarthi)
199211
- Now only lists the external [typing](https://pypi.org/project/typing/)
200212
module as a dependency for Python `<= 3.4`, as it was integrated in the standard
201213
library in Python 3.5 (thanks to Erwan de Lépinau @ErwanDL).
202214
- Added LICENSE to pypi (thanks to Mark Harfouche @hmaarrfk)
203215

204-
- 1.4.9 (2020-04-30)
216+
## 1.4.9 (2020-04-30)
205217
- Changed `get_available_testsets()` to return a list
206218
- Remove Japanese MeCab tokenizer from requirements.
207219
(Must be installed manually to avoid Windows incompatibility).
208220
Many thanks to Makoto Morishita (@MorinoseiMorizo).
209221

210-
- 1.4.8 (2020-04-26)
222+
## 1.4.8 (2020-04-26)
211223
- Added to API:
212224
- get_source_file()
213225
- get_reference_files()
@@ -217,21 +229,21 @@
217229
- Fixed descriptions of some WMT19/google test sets
218230
- Added API test case (test/test_apy.py)
219231

220-
- 1.4.7 (2020-04-19)
232+
## 1.4.7 (2020-04-19)
221233
- Added Google's extra wmt19/en-de refs (-t wmt19/google/{ar,arp,hqall,hqp,hqr,wmtp})
222234
(Freitag, Grangier, & Caswell
223235
BLEU might be Guilty but References are not Innocent
224236
https://arxiv.org/abs/2004.06063)
225237
- Restored SACREBLEU_DIR and smart_open to exports (thanks to Thomas Liao @tholiao)
226238

227-
- 1.4.6 (2020-03-28)
239+
## 1.4.6 (2020-03-28)
228240
- Large internal reorganization as a module (thanks to Thamme Gowda @thammegowda)
229241

230-
- 1.4.5 (2020-03-28)
242+
## 1.4.5 (2020-03-28)
231243
- Added Japanese MeCab tokenizer (`-tok ja-mecab`) (thanks to Makoto Morishita @MorinoseiMorizo)
232244
- Added wmt20/dev test sets (thanks to Martin Popel @martinpopel)
233245

234-
- 1.4.4 (2020-03-10)
246+
## 1.4.4 (2020-03-10)
235247
- Smoothing changes (Sebastian Nickels @sn1c)
236248
- Fixed bug that only applied smoothing to n-grams for n > 2
237249
- Added default smoothing values for methods "floor" (0) and "add-k" (1)
@@ -240,20 +252,20 @@
240252
- added missing languages for IWSLT17
241253
- Minor code improvements (Thomas Liao @tholiao)
242254

243-
- 1.4.3 (2019-12-02)
255+
## 1.4.3 (2019-12-02)
244256
- Bugfix: handling of result object for CHRF
245257
- Improved API example
246258

247-
- 1.4.2 (2019-10-11)
259+
## 1.4.2 (2019-10-11)
248260
- Tokenization variant omitted from the chrF signature; it is relevant only for BLEU (thanks to Martin Popel)
249261
- Bugfix: call to sentence_bleu (thanks to Rachel Bawden)
250262
- Documentation example for Python API (thanks to Vlad Lyalin)
251263
- Calls to corpus_chrf and sentence_chrf now return a an object instead of a float (use result.score)
252264

253-
- 1.4.1 (2019-09-11)
265+
## 1.4.1 (2019-09-11)
254266
- Added sentence-level scoring via -sl (--sentence-level)
255267

256-
- 1.4.0 (2019-09-10)
268+
## 1.4.0 (2019-09-10)
257269
- Many thanks to Martin Popel for all the changes below!
258270
- Added evaluation on concatenated test sets (e.g., `-t wmt17,wmt18`).
259271
Works as long as they all have the same language pair.
@@ -269,102 +281,102 @@
269281
- Documentation and tests updates
270282
- Fixed a race condition bug (`os.makedirs(outdir, exist_ok=True)` instead of `if os.path.exists`)
271283

272-
- 1.3.7 (2019-07-12)
284+
## 1.3.7 (2019-07-12)
273285
- Lazy loading of regexes cuts import time from ~1s to nearly nothing (thanks, @louismartin!)
274286
- Added a simple (non-atomic) lock on downloading
275287
- Can now read multiple refs from a single tab-delimited file.
276288
You need to pass `--num-refs N` to tell it to run the split.
277289
Only works with a single reference file passed from the command line.
278290

279-
- 1.3.6 (2019-06-10)
291+
## 1.3.6 (2019-06-10)
280292
- Removed another f-string for Python 3.5 compatibility
281293

282-
- 1.3.5 (2019-06-07)
294+
## 1.3.5 (2019-06-07)
283295
- Restored Python 3.5 compatibility
284296

285-
- 1.3.4 (2019-05-28)
297+
## 1.3.4 (2019-05-28)
286298
- Added MTNT 2019 test sets
287299
- Added a BLEU object
288300

289-
- 1.3.3 (2019-05-08)
301+
## 1.3.3 (2019-05-08)
290302
- Added WMT'19 test sets
291303

292-
- 1.3.2 (2018-04-24)
304+
## 1.3.2 (2018-04-24)
293305
- Bugfix in test case (thanks to Adam Roberts, @adarob)
294306
- Passing smoothing method through `sentence_bleu`
295307

296-
- 1.3.1 (2019-03-20)
308+
## 1.3.1 (2019-03-20)
297309
- Added another smoothing approach (add-k) and a command-line option for choosing the smoothing method
298310
(`--smooth exp|floor|add-n|none`) and the associated value (`--smooth-value`), when relevant.
299311
- Changed interface to some functions (backwards incompatible)
300312
- 'smooth' is now 'smooth_method'
301313
- 'smooth_floor' is now 'smooth_value'
302314

303-
- 1.2.21 (19 March 2019)
315+
## 1.2.21 (19 March 2019)
304316
- Ctrl-M characters are now treated as normal characters, previously treated as newline.
305317

306-
- 1.2.20 (28 February 2018)
318+
## 1.2.20 (28 February 2018)
307319
- Tokenization now defaults to "zh" when language pair is known
308320

309-
- 1.2.19 (19 February 2019)
321+
## 1.2.19 (19 February 2019)
310322
- Updated checksum for wmt19/dev (seems to have changed)
311323

312-
- 1.2.18 (19 February 2019)
324+
## 1.2.18 (19 February 2019)
313325
- Fixed checksum for wmt17/dev (copy-paste error)
314326

315-
- 1.2.17 (6 February 2019)
327+
## 1.2.17 (6 February 2019)
316328
- Added kk-en and en-kk to wmt19/dev
317329

318-
- 1.2.16 (4 February 2019)
330+
## 1.2.16 (4 February 2019)
319331
- Added gu-en and en-gu to wmt19/dev
320332

321-
- 1.2.15 (30 January 2019)
333+
## 1.2.15 (30 January 2019)
322334
- Added MD5 checksumming of downloaded files for all datasets.
323335

324-
- 1.2.14 (22 January 2019)
336+
## 1.2.14 (22 January 2019)
325337
- Added mtnt1.1/train mtnt1.1/valid mtnt1.1/test data from [MTNT](http://www.cs.cmu.edu/~pmichel1/mtnt/)
326338

327-
- 1.2.13 (22 January 2019)
339+
## 1.2.13 (22 January 2019)
328340
- Added 'wmt19/dev' task for 'lt-en' and 'en-lt' (development data for new tasks).
329341
- Added MD5 checksum for downloaded tarballs.
330342

331-
- 1.2.12 (8 November 2018)
343+
## 1.2.12 (8 November 2018)
332344
- Now outputs only only digit after the decimal
333345

334-
- 1.2.11 (29 August 2018)
346+
## 1.2.11 (29 August 2018)
335347
- Added a function for sentence-level, smoothed BLEU
336348

337-
- 1.2.10 (23 May 2018)
349+
## 1.2.10 (23 May 2018)
338350
- Added wmt18 test set (with references)
339351

340-
- 1.2.9 (15 May 2018)
352+
## 1.2.9 (15 May 2018)
341353
- Added zh-en, en-zh, tr-en, and en-tr datasets for wmt18/test-ts
342354

343-
- 1.2.8 (14 May 2018)
355+
## 1.2.8 (14 May 2018)
344356
- Added wmt18/test-ts, the test sources (only) for [WMT18](http://statmt.org/wmt18/translation-task.html)
345357
- Moved README out of `sacrebleu.py` and the CHANGELOG into a separate file
346358

347-
- 1.2.7 (10 April 2018)
359+
## 1.2.7 (10 April 2018)
348360
- fixed another locale issue (with --echo)
349361
- grudgingly enabled `-tok none` from the command line
350362

351-
- 1.2.6 (22 March 2018)
363+
## 1.2.6 (22 March 2018)
352364
- added wmt17/ms (Microsoft's [additional ZH-EN references](https://github.com/MicrosoftTranslator/Translator-HumanParityData)).
353365
Try `sacrebleu -t wmt17/ms --cite`.
354366
- `--echo ref` now pastes together all references, if there is more than one
355367

356-
- 1.2.5 (13 March 2018)
368+
## 1.2.5 (13 March 2018)
357369
- added wmt18/dev datasets (en-et and et-en)
358370
- fixed logic with --force
359371
- locale-independent installation
360372
- added "--echo both" (tab-delimited)
361373

362-
- 1.2.3 (28 January 2018)
374+
## 1.2.3 (28 January 2018)
363375
- metrics (`-m`) are now printed in the order requested
364376
- chrF now prints a version string (including the beta parameter, importantly)
365377
- attempt to remove dependence on locale setting
366378

367-
- 1.2 (17 January 2018)
379+
## 1.2 (17 January 2018)
368380
- added the chrF metric (`-m chrf` or `-m bleu chrf` for both)
369381
See 'CHRF: character n-gram F-score for automatic MT evaluation' by Maja Popovic (WMT 2015)
370382
[http://www.statmt.org/wmt15/pdf/WMT49.pdf]
@@ -374,26 +386,26 @@
374386
- added `--input` (`-i`) to set input to a file instead of STDIN
375387
- removed accent mark after objection from UN official
376388

377-
- 1.1.7 (27 November 2017)
389+
## 1.1.7 (27 November 2017)
378390
- corpus_bleu() now raises an exception if input streams are different lengths
379391
- thanks to Martin Popel for:
380392
- small bugfix in tokenization_13a (not affecting WMT references)
381393
- adding `--tok intl` (international tokenization)
382394
- added wmt17/dev and wmt17/dev sets (for languages intro'd those years)
383395

384-
- 1.1.6 (15 November 2017)
396+
## 1.1.6 (15 November 2017)
385397
- bugfix for tokenization warning
386398

387-
- 1.1.5 (12 November 2017)
399+
## 1.1.5 (12 November 2017)
388400
- added -b option (only output the BLEU score)
389401
- removed fi-en from list of WMT16/17 systems with more than one reference
390402
- added WMT16/tworefs and WMT17/tworefs for scoring with both en-fi references
391403

392-
- 1.1.4 (10 November 2017)
404+
## 1.1.4 (10 November 2017)
393405
- added effective order for sentence-level BLEU computation
394406
- added unit tests from sockeye
395407

396-
- 1.1.3 (8 November 2017).
408+
## 1.1.3 (8 November 2017).
397409
- Factored code a bit to facilitate API:
398410
- compute_bleu: works from raw stats
399411
- corpus_bleu for use from the command line
@@ -402,17 +414,17 @@
402414
- Added 'floor' smoothing (adds 0.01 to 0 counts, more versatile via API), 'none' smoothing (via API)
403415
- Small bugfixes, windows compatibility (H/T Christian Federmann)
404416

405-
- 1.0.3 (4 November 2017).
417+
## 1.0.3 (4 November 2017).
406418
- Contributions from Christian Federmann:
407419
- Added explicit support for encoding
408420
- Fixed Windows support
409421
- Bugfix in handling reference length with multiple refs
410422

411-
- version 1.0.1 (1 November 2017).
423+
## version 1.0.1 (1 November 2017).
412424
- Small bugfix affecting some versions of Python.
413425
- Code reformatting due to Ozan Çağlayan.
414426

415-
- version 1.0 (23 October 2017).
427+
## version 1.0 (23 October 2017).
416428
- Support for WMT 2008--2017.
417429
- Single tokenization (v13a) with lowercase fix (proper lower() instead of just A-Z).
418430
- Chinese tokenization.

mypy.ini

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[mypy]
2-
python_version = 3.6
2+
python_version = 3.12
33

44
[mypy-portalocker.*]
55
ignore_missing_imports = True

0 commit comments

Comments
 (0)