Skip to content

Conversation

RobinPicard
Copy link
Contributor

Closes #224

The problem described in the issue above is caused by an incompatibility between the vocabulary provided by the user and the regex used to create the DFA. This issue is typically caused by a wrongful encoding of the the vocabulary by the user (special tokens from their tokenizer are included).

This PR proposes to do 2 things about it:

  • Update the definition of the Index object to raise an error at initialization if such an incompatibility exist (instead of leading to a situation in which an error could arise during inference)
  • Update the README to give more information on how to create a Vocabulary and warn users about this problem

@RobinPicard RobinPicard requested a review from rlouf August 4, 2025 14:37
rlouf
rlouf previously approved these changes Aug 4, 2025
@RobinPicard RobinPicard force-pushed the raise_error_incompatible_vocab branch from aba7578 to b6a5133 Compare August 4, 2025 20:11
@RobinPicard RobinPicard enabled auto-merge (rebase) August 4, 2025 20:27
@RobinPicard RobinPicard requested a review from rlouf August 4, 2025 20:31
@RobinPicard RobinPicard self-assigned this Aug 4, 2025
@RobinPicard RobinPicard force-pushed the raise_error_incompatible_vocab branch from 9a3aae1 to b6a5133 Compare August 5, 2025 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"No next state found for the current state" error

2 participants