Skip to content

Conversation

@chanr0
Copy link

@chanr0 chanr0 commented Jun 6, 2025

add eos handling


# construct mappings from byte vocab to indices in weight array
self.trie_decode = (
list(range(256)) + atomic_tokens + self.eos_tokens + [self.eot_token]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may not need all the bytes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think about limited vocab case

)
self.trie_encode = {
k: v for k, v in zip(self.trie_decode, list(range(len(self.trie_decode))))
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a check for encode that there are no duplicates! (that decode doesn't have duplicate values)

return self.materialize().map_keys(
lambda x: bytes([x]) if x is not None else "EOT"
lambda x: bytes([x]) if x in range(256) else ("EOT" if x is None else x)
)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle EOT being something other than None, e.g., decode[-1]

return y.replace(" ", "␣")


def split_with_atomic_tokens(data: bytes, atomic_tokens: list[bytes]) -> list[Union[int, bytes]]:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benlebrun need to check with tim what the best handling is

"""
if node := self.children[self.node].get(b):
mass = self.mass
if b in self.trie.trie.eos_tokens:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not good, handle during trie construction. For instance, assume your eos tokens are 'ab', 'cd'

  • during prefill, you want to have the a-b, c-d, i.e., byte-level nodes
  • during sampling, you want to group them both into a single EOS node

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants