Skip to content
This repository was archived by the owner on Feb 17, 2024. It is now read-only.
This repository was archived by the owner on Feb 17, 2024. It is now read-only.

How much data was used to train the mT5 tokenizer? #117

@kellymarchisio

Description

@kellymarchisio

Though not explicitly stated in the paper, I understand that mT5 uses a SentencePiece Unigram tokenizer (please correct me if I am wrong). I cannot seem to find how much data this tokenizer was trained on.

The mT5 paper says, "As in T5, we use SentencePiece (Kudo and Richardson, 2018; Kudo, 2018) models trained with the language sampling rates used during pre-training." The T5 paper says, "Then, we trained our SentencePiece model on a mixture of 10 parts of English C4 data with 1 part each of data classified as German, French or Romanian.", but I do not see what the raw GB and/or token counts for the training data for the tokenizer.

How much data was the tokenizer trained on? (And, if you recall, approximately how long did it take to train, and how much RAM was required?)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions