-
Notifications
You must be signed in to change notification settings - Fork 4
An LLM tokenizer implemented as a streamly application #78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Map stream of bytes to index values
I have not understood the program, but just had a cursory look at it:
|
Yeah, I will change the example to use Streamly Array. There are two pipes one of them can be written as a scanl but the other one might be quite difficult. 🤔 I was hoping to make two examples for this application one that's easy to read and one that's performance oriented. There's even some parts of the logic that can benefit from parallelism. What do you think? |
Perhaps you can make the performance-oriented version easy to read :-) |
|
||
-- Stores byte-sequence-to-index mapping and index-to-text mapping | ||
data ByteMappings = ByteMappings | ||
{ byteToIndex :: !(M.Map Word8 Int), -- Maps bytes to unique indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Map Word8 Int
can just be a mutable array and can benefit from O(1) peeking and poking.
addPair acc chunk = | ||
case Array.toList chunk of | ||
[b1, b2] -> M.insertWith (+) (b1, b2) 1 acc | ||
_ -> acc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can index into the Array directly.
updateMappings (ByteMappings b2i s2i i2t nidx) (i1, i2) = | ||
let text1 = M.findWithDefault "?" i1 i2t | ||
text2 = M.findWithDefault "?" i2 i2t | ||
newToken = text1 ++ text2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use Text
or utf8 encoded Array Wod8
instead of String
.
(nidx + 1) | ||
|
||
{-# INLINE replaceMostFrequentPair #-} | ||
replaceMostFrequentPair :: (Monad m) => (Int, Int) -> Int -> Pipe m Int Int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you describe what this function does? Some examples would help.
-- Stores byte-sequence-to-index mapping and index-to-text mapping | ||
data ByteMappings = ByteMappings | ||
{ byteToIndex :: !(M.Map Word8 Int), -- Maps bytes to unique indices | ||
seqToIndex :: !(M.Map (V.Vector Word8) Int), -- Maps sequences of bytes to unique indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious to see how a cuckoo hash table might behave in this case.
We can try using it and checking performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use an unboxed Array instead of a Vector Word8
Array Word8
basically.
data ByteMappings = ByteMappings | ||
{ byteToIndex :: !(M.Map Word8 Int), -- Maps bytes to unique indices | ||
seqToIndex :: !(M.Map (V.Vector Word8) Int), -- Maps sequences of bytes to unique indices | ||
indexToText :: !(M.Map Int String), -- Maps indices to text representation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can maybe use Text
or Utf8 encoded Array Word8
for this?
let text1 = M.findWithDefault "?" i1 i2t | ||
text2 = M.findWithDefault "?" i2 i2t | ||
newToken = text1 ++ text2 | ||
bytes = V.fromList $ map charToWord8 newToken |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks incorrect. Char
is essentially 4 bytes. You are losing information here.
Char -> [Word8]
Unless you're strictly using ASCII. In that case, you needn't use Char
.
-- reset the state (starting with the current byte), and continue. | ||
{-# INLINE greedyTokenizer #-} | ||
greedyTokenizer :: (Monad m) => ByteMappings -> Pipe m Word8 String | ||
greedyTokenizer mapping = Pipe consume produce (V.empty, "", 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just write this as a Stream
.
A greedy tokenizer breaks text into words based on data driven rules it has learnt. The learning phase finds the most common pair of tokens in the data and merges them into a new token.
This is a pure text processing application which can be re-imagined as a streaming application, a study of all three fundamental constructs of streaming - Streams, Folds and Pipes and a demonstration of the streamly framework.
A review is welcome.