Skip to content

An LLM tokenizer implemented as a streamly application #78

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

twitu
Copy link

@twitu twitu commented Feb 17, 2025

A greedy tokenizer breaks text into words based on data driven rules it has learnt. The learning phase finds the most common pair of tokens in the data and merges them into a new token.

This is a pure text processing application which can be re-imagined as a streaming application, a study of all three fundamental constructs of streaming - Streams, Folds and Pipes and a demonstration of the streamly framework.

A review is welcome.

@harendra-kumar
Copy link
Member

I have not understood the program, but just had a cursory look at it:

  • Can you use Streamly Array/MutArray instead of vector? If not what is lacking in arrays that stops you from doing that?
  • Pipe API is experimental, we have not finished that yet, though it works, it may have bugs. Would be better if the job can be done by Scanl module instead.
  • Do you have some test data to measure performance? How can we access that?

@twitu
Copy link
Author

twitu commented Feb 17, 2025

Yeah, I will change the example to use Streamly Array.

There are two pipes one of them can be written as a scanl but the other one might be quite difficult. 🤔

I was hoping to make two examples for this application one that's easy to read and one that's performance oriented. There's even some parts of the logic that can benefit from parallelism. What do you think?

@adithyaov
Copy link
Member

I was hoping to make two examples for this application one that's easy to read and one that's performance oriented. There's even some parts of the logic that can benefit from parallelism. What do you think?

Perhaps you can make the performance-oriented version easy to read :-)


-- Stores byte-sequence-to-index mapping and index-to-text mapping
data ByteMappings = ByteMappings
{ byteToIndex :: !(M.Map Word8 Int), -- Maps bytes to unique indices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Map Word8 Int can just be a mutable array and can benefit from O(1) peeking and poking.

Comment on lines +85 to +88
addPair acc chunk =
case Array.toList chunk of
[b1, b2] -> M.insertWith (+) (b1, b2) 1 acc
_ -> acc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can index into the Array directly.

updateMappings (ByteMappings b2i s2i i2t nidx) (i1, i2) =
let text1 = M.findWithDefault "?" i1 i2t
text2 = M.findWithDefault "?" i2 i2t
newToken = text1 ++ text2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use Text or utf8 encoded Array Wod8 instead of String.

(nidx + 1)

{-# INLINE replaceMostFrequentPair #-}
replaceMostFrequentPair :: (Monad m) => (Int, Int) -> Int -> Pipe m Int Int
Copy link
Member

@adithyaov adithyaov Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you describe what this function does? Some examples would help.

-- Stores byte-sequence-to-index mapping and index-to-text mapping
data ByteMappings = ByteMappings
{ byteToIndex :: !(M.Map Word8 Int), -- Maps bytes to unique indices
seqToIndex :: !(M.Map (V.Vector Word8) Int), -- Maps sequences of bytes to unique indices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious to see how a cuckoo hash table might behave in this case.
We can try using it and checking performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use an unboxed Array instead of a Vector Word8
Array Word8 basically.

data ByteMappings = ByteMappings
{ byteToIndex :: !(M.Map Word8 Int), -- Maps bytes to unique indices
seqToIndex :: !(M.Map (V.Vector Word8) Int), -- Maps sequences of bytes to unique indices
indexToText :: !(M.Map Int String), -- Maps indices to text representation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can maybe use Text or Utf8 encoded Array Word8 for this?

let text1 = M.findWithDefault "?" i1 i2t
text2 = M.findWithDefault "?" i2 i2t
newToken = text1 ++ text2
bytes = V.fromList $ map charToWord8 newToken
Copy link
Member

@adithyaov adithyaov Feb 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks incorrect. Char is essentially 4 bytes. You are losing information here.

Char -> [Word8]

Unless you're strictly using ASCII. In that case, you needn't use Char.

-- reset the state (starting with the current byte), and continue.
{-# INLINE greedyTokenizer #-}
greedyTokenizer :: (Monad m) => ByteMappings -> Pipe m Word8 String
greedyTokenizer mapping = Pipe consume produce (V.empty, "", 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just write this as a Stream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants