You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Feb 17, 2024. It is now read-only.
Though not explicitly stated in the paper, I understand that mT5 uses a SentencePiece Unigram tokenizer (please correct me if I am wrong). I cannot seem to find how much data this tokenizer was trained on.
The mT5 paper says, "As in T5, we use SentencePiece (Kudo and Richardson, 2018; Kudo, 2018) models trained with the language sampling rates used during pre-training." The T5 paper says, "Then, we trained our SentencePiece model on a mixture of 10 parts of English C4 data with 1 part each of data classified as German, French or Romanian.", but I do not see what the raw GB and/or token counts for the training data for the tokenizer.
How much data was the tokenizer trained on? (And, if you recall, approximately how long did it take to train, and how much RAM was required?)