How much data was used to train the mT5 tokenizer?

Though not explicitly stated in the paper, I understand that mT5 uses a SentencePiece Unigram tokenizer (please correct me if I am wrong).  I cannot seem to find how much data this tokenizer was trained on.  

The mT5 paper says, _"As in T5, we use SentencePiece (Kudo and Richardson, 2018; Kudo, 2018) models trained with the language sampling rates used during pre-training."_ The T5 paper says, _"Then, we trained our SentencePiece model on a mixture of 10 parts of English C4 data with 1 part each of data classified as German, French or Romanian._", but I do not see what the raw GB and/or token counts for the training data for the tokenizer.  

How much data was the tokenizer trained on? (And, if you recall, approximately how long did it take to train, and how much RAM was required?)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How much data was used to train the mT5 tokenizer? #117

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How much data was used to train the mT5 tokenizer? #117

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions