Fix llama 3 data loader #736

gordicaleksa · 2024-08-10T19:26:15Z

Add LLaMA 3 tokenization support for all our datasets:

tiny shakespeare
tiny stories
fineweb

karpathy · 2024-08-12T16:08:41Z

dev/data/fineweb.py

+        eot = enc._special_tokens['<|endoftext|>'] # end of text token
+        tokens = [eot] # the special <|endoftext|> token delimits all documents
+    elif model == "llama":
+        tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B")


should we not use tiktoken in the exact same way as the official meta release shows? I like that it's more explicit. AutoTokenizer is a black box

autotokenizer is a blackbox, I agree, but i believe we can be confident that at least for an architecture as popular as LLaMA 3 HuggingFace is battle-tested!

I prefer that over downloading a tokenizer and passing in a path.

What do you think? (also: we already depend on huggingface eitherway)

karpathy · 2024-08-12T16:09:32Z

dev/data/fineweb.py

    # tokenizes a single document and returns a numpy array of uint16 tokens
-    tokens = [eot] # the special <|endoftext|> token delimits all documents
-    tokens.extend(enc.encode_ordinary(doc["text"]))
+    text = doc["text"]


this order was intentional, the delimiter should be prepended to docs, so you can inference just starting from that single token.

I didn't change it if you look above? We still have eot followed by encode? (if I understood you correctly)

karpathy · 2024-08-12T16:10:20Z

dev/data/fineweb.py

-    tokens_np_uint16 = tokens_np.astype(np.uint16)
-    return tokens_np_uint16
+
+    if model == "gpt-2":


refactor to delete if and copy paste dcode, use ternary operator to set the upper_bound

and a ternary in the assert statements as well?

karpathy · 2024-08-12T16:11:02Z

dev/data/fineweb.py

@@ -99,7 +120,7 @@ def tokenize(doc):
            remainder = args.shard_size - token_count
            progress_bar.update(remainder)
            all_tokens_np[token_count:token_count+remainder] = tokens[:remainder]
-            write_datafile(filename, all_tokens_np)
+            write_datafile(filename, list(all_tokens_np), args.model)


why convert to list

i simplified write_datafile so that it doesn't have to handle both numpy & lists, it's cleaner i think (?)

There could be many tokens, so creating a Python list could be very wasteful

>>> a = np.random.randn(10) >>> a array([-1.39200423, 0.91909499, 0.49247546, 0.73578011, -0.46485352, 0.06844696, 1.21521025, 0.18951044, -0.33376094, 1.03115886]) >>> list(a) [-1.3920042324598616, 0.9190949922347375, 0.49247545796208686, 0.7357801064341112, -0.4648535191489631, 0.06844695804812885, 1.2152102515229188, 0.18951044050354424, -0.33376094056177236, 1.0311588596558752] >>> z = list(a) >>> z[0] -1.3920042324598616 >>> type(z[0]) <class 'numpy.float64'>

vs tolist

>>> a = np.random.randn(10) >>> a array([-0.28416783, 3.61778557, 0.45557321, 0.6585392 , -0.54974637, -0.50662981, 0.36080734, 0.76378507, -1.60443242, 0.41719901]) >>> a.tolist() [-0.2841678282966355, 3.6177855666263548, 0.45557321422210056, 0.6585391952854299, -0.5497463693208792, -0.5066298099246493, 0.36080734397795633, 0.7637850737170351, -1.6044324246329673, 0.4171990143035489] >>> z = a.tolist() >>> z[0] -0.2841678282966355 >>> type(z[0]) <class 'float'>

karpathy · 2024-08-12T16:13:41Z

dev/data/fineweb.py

-enc = tiktoken.get_encoding("gpt2")
-eot = enc._special_tokens['<|endoftext|>'] # end of text token
-def tokenize(doc):
+def tokenize(doc, model):


instead of taking model and having a big if inside def we can have two defs for the two options, and dispatch accordingly

Sure, at this point, I don't have a special preference as we only support 2 models.

dev/data/fineweb.py

dev/data/data_common.py

karpathy · 2024-08-13T00:15:07Z

dev/data/tinyshakespeare.py

+    elif model == "llama":
+        tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B")
+        def encode(x):
+            return tokenizer(x).input_ids


pretty sure this now creates bug for this code because <|endoftext|> (below) doesn't tokenize properly

karpathy · 2024-08-13T02:10:33Z

dev/data/fineweb.py

@@ -80,6 +96,15 @@ def tokenize(doc):
    all_tokens_np = np.empty((args.shard_size,), dtype=np.uint16)
    token_count = 0
    progress_bar = None
+


doesn't def tokenize break because:

all_tokens_np = np.empty((args.shard_size,), dtype=np.uint16)

i.e. the init is using uint16

gordicaleksa added 8 commits August 10, 2024 21:25

Add llama 3 tokenization and data loader

d3151c4

Update header description

71419cb

Change default dataset path to tiny stories

b018b0c

Add tiny shakespeare

15b76bf

Use fire module, update readme

f8c1ff5

Add fineweb llama 3 support

5076563

Add fire to reqs

311c149

Replace fire with argparse

eb1b359

karpathy reviewed Aug 12, 2024

View reviewed changes

dev/data/fineweb.py Show resolved Hide resolved

karpathy reviewed Aug 12, 2024

View reviewed changes

dev/data/data_common.py Show resolved Hide resolved

gordicaleksa added 2 commits August 12, 2024 20:17

Refactor magic/version header dict

e8739c8

Refactor tokenize, delegate instead

68624cc

karpathy reviewed Aug 13, 2024

View reviewed changes

karpathy merged commit 9740a65 into karpathy:master Aug 13, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix llama 3 data loader #736

Fix llama 3 data loader #736

gordicaleksa commented Aug 10, 2024 •

edited

Loading

karpathy Aug 12, 2024

gordicaleksa Aug 12, 2024 •

edited

Loading

karpathy Aug 12, 2024

gordicaleksa Aug 12, 2024

karpathy Aug 12, 2024

gordicaleksa Aug 12, 2024

karpathy Aug 12, 2024

gordicaleksa Aug 12, 2024 •

edited

Loading

karpathy Aug 12, 2024

karpathy Aug 13, 2024

karpathy Aug 13, 2024

karpathy Aug 12, 2024

gordicaleksa Aug 12, 2024

karpathy Aug 13, 2024

karpathy Aug 13, 2024

Fix llama 3 data loader #736

Fix llama 3 data loader #736

Conversation

gordicaleksa commented Aug 10, 2024 • edited Loading

Choose a reason for hiding this comment

gordicaleksa Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gordicaleksa Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gordicaleksa commented Aug 10, 2024 •

edited

Loading

gordicaleksa Aug 12, 2024 •

edited

Loading

gordicaleksa Aug 12, 2024 •

edited

Loading