An LLM tokenizer implemented as a streamly application #78

twitu · 2025-02-17T11:34:58Z

A greedy tokenizer breaks text into words based on data driven rules it has learnt. The learning phase finds the most common pair of tokens in the data and merges them into a new token.

This is a pure text processing application which can be re-imagined as a streaming application, a study of all three fundamental constructs of streaming - Streams, Folds and Pipes and a demonstration of the streamly framework.

A review is welcome.

Map stream of bytes to index values

harendra-kumar · 2025-02-17T11:42:16Z

I have not understood the program, but just had a cursory look at it:

Can you use Streamly Array/MutArray instead of vector? If not what is lacking in arrays that stops you from doing that?
Pipe API is experimental, we have not finished that yet, though it works, it may have bugs. Would be better if the job can be done by Scanl module instead.
Do you have some test data to measure performance? How can we access that?

twitu · 2025-02-17T20:17:08Z

Yeah, I will change the example to use Streamly Array.

There are two pipes one of them can be written as a scanl but the other one might be quite difficult. 🤔

I was hoping to make two examples for this application one that's easy to read and one that's performance oriented. There's even some parts of the logic that can benefit from parallelism. What do you think?

adithyaov · 2025-02-19T07:14:01Z

I was hoping to make two examples for this application one that's easy to read and one that's performance oriented. There's even some parts of the logic that can benefit from parallelism. What do you think?

Perhaps you can make the performance-oriented version easy to read :-)

adithyaov · 2025-02-19T07:17:20Z

examples/BytePairEncoder.hs

+
+-- Stores byte-sequence-to-index mapping and index-to-text mapping
+data ByteMappings = ByteMappings
+  { byteToIndex :: !(M.Map Word8 Int), -- Maps bytes to unique indices


This Map Word8 Int can just be a mutable array and can benefit from O(1) peeking and poking.

adithyaov · 2025-02-19T07:26:46Z

examples/BytePairEncoder.hs

+    addPair acc chunk =
+      case Array.toList chunk of
+        [b1, b2] -> M.insertWith (+) (b1, b2) 1 acc
+        _ -> acc


You can index into the Array directly.

adithyaov · 2025-02-19T07:28:31Z

examples/BytePairEncoder.hs

+updateMappings (ByteMappings b2i s2i i2t nidx) (i1, i2) =
+  let text1 = M.findWithDefault "?" i1 i2t
+      text2 = M.findWithDefault "?" i2 i2t
+      newToken = text1 ++ text2


You should use Text or utf8 encoded Array Wod8 instead of String.

adithyaov · 2025-02-19T07:30:11Z

examples/BytePairEncoder.hs

+        (nidx + 1)
+
+{-# INLINE replaceMostFrequentPair #-}
+replaceMostFrequentPair :: (Monad m) => (Int, Int) -> Int -> Pipe m Int Int


Could you describe what this function does? Some examples would help.

adithyaov · 2025-02-23T10:26:37Z

examples/BytePairEncoder.hs

+-- Stores byte-sequence-to-index mapping and index-to-text mapping
+data ByteMappings = ByteMappings
+  { byteToIndex :: !(M.Map Word8 Int), -- Maps bytes to unique indices
+    seqToIndex :: !(M.Map (V.Vector Word8) Int), -- Maps sequences of bytes to unique indices


I'm curious to see how a cuckoo hash table might behave in this case.
We can try using it and checking performance.

You can use an unboxed Array instead of a Vector Word8
Array Word8 basically.

adithyaov · 2025-02-23T10:28:08Z

examples/BytePairEncoder.hs

+data ByteMappings = ByteMappings
+  { byteToIndex :: !(M.Map Word8 Int), -- Maps bytes to unique indices
+    seqToIndex :: !(M.Map (V.Vector Word8) Int), -- Maps sequences of bytes to unique indices
+    indexToText :: !(M.Map Int String), -- Maps indices to text representation


You can maybe use Text or Utf8 encoded Array Word8 for this?

adithyaov · 2025-02-23T10:31:24Z

examples/BytePairEncoder.hs

+  let text1 = M.findWithDefault "?" i1 i2t
+      text2 = M.findWithDefault "?" i2 i2t
+      newToken = text1 ++ text2
+      bytes = V.fromList $ map charToWord8 newToken


This looks incorrect. Char is essentially 4 bytes. You are losing information here.

Char -> [Word8]

Unless you're strictly using ASCII. In that case, you needn't use Char.

adithyaov · 2025-02-23T10:35:56Z

examples/BytePairEncoder.hs

+--     reset the state (starting with the current byte), and continue.
+{-# INLINE greedyTokenizer #-}
+greedyTokenizer :: (Monad m) => ByteMappings -> Pipe m Word8 String
+greedyTokenizer mapping = Pipe consume produce (V.empty, "", 0)


You can just write this as a Stream.

twitu added 10 commits February 3, 2025 00:46

Add byte indexing logic

7dfe7b4

Byte mapping maps sequences of bytes to index

4aad2ed

Map stream of bytes to index values

Count and merge frequent pairs

05006e8

Replace most frequent pair with new index in stream

82ca052

Stream updated mappings

0efe1c6

Tokenize string with byte mapping

90fb23f

Fix tokenizer logic

372d676

Cleanup

9dc288a

Add example executable to cabal and config

21655cf

Add run instructions

52f5a3e

adithyaov reviewed Feb 19, 2025

View reviewed changes

adithyaov reviewed Feb 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

An LLM tokenizer implemented as a streamly application #78

An LLM tokenizer implemented as a streamly application #78

Uh oh!

twitu commented Feb 17, 2025

Uh oh!

harendra-kumar commented Feb 17, 2025

Uh oh!

twitu commented Feb 17, 2025

Uh oh!

adithyaov commented Feb 19, 2025

Uh oh!

adithyaov Feb 19, 2025

Uh oh!

adithyaov Feb 19, 2025

Uh oh!

adithyaov Feb 19, 2025

Uh oh!

adithyaov Feb 19, 2025 •

edited

Loading

Uh oh!

adithyaov Feb 23, 2025

Uh oh!

adithyaov Feb 23, 2025

Uh oh!

adithyaov Feb 23, 2025

Uh oh!

adithyaov Feb 23, 2025 •

edited

Loading

Uh oh!

adithyaov Feb 23, 2025

Uh oh!

Uh oh!

An LLM tokenizer implemented as a streamly application #78

Are you sure you want to change the base?

An LLM tokenizer implemented as a streamly application #78

Uh oh!

Conversation

twitu commented Feb 17, 2025

Uh oh!

harendra-kumar commented Feb 17, 2025

Uh oh!

twitu commented Feb 17, 2025

Uh oh!

adithyaov commented Feb 19, 2025

Uh oh!

adithyaov Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

adithyaov Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

adithyaov Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

adithyaov Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adithyaov Feb 23, 2025

Choose a reason for hiding this comment

Uh oh!

adithyaov Feb 23, 2025

Choose a reason for hiding this comment

Uh oh!

adithyaov Feb 23, 2025

Choose a reason for hiding this comment

Uh oh!

adithyaov Feb 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adithyaov Feb 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adithyaov Feb 19, 2025 •

edited

Loading

adithyaov Feb 23, 2025 •

edited

Loading