Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch Tokenization Support #7371

Open
Tracked by #7383
tjwald opened this issue Jan 23, 2025 · 1 comment
Open
Tracked by #7383

Batch Tokenization Support #7371

tjwald opened this issue Jan 23, 2025 · 1 comment
Assignees
Labels
enhancement New feature or request Tokenizers
Milestone

Comments

@tjwald
Copy link

tjwald commented Jan 23, 2025

Is your feature request related to a problem? Please describe.
Most AI systems use batching for performance reasons, needing all tokenized sentences being the same size and outputting a mask of which values are padding.
In my project I had to implement this myself. The issues are mostly performance and API compatibility with the ecosystem.
With my solution - There are megabytes of allocations:

Image

The Int64[] allocations are due to widening needed to be done since the ONNX model needs Tensor as input.
The int32[] allocations are the actual tokens.
The strings allocations are token strings that are not used and are thrown away.
The other allocations are internal, and I don't know what they are.

Describe the solution you'd like
Enable 0 allocations solution via an API like the following:

class Tokenizer
{
     ...
     public abstract void BatchTokenize<T, K>(ReadOnlySpan<string> texts,  int maxTokenCount, Tensor<T> inoutIds, Tensor<T> inputMask) 
              where T: INumber<T>;
     
     public abstract void BatchTokenize<T>(ReadOnlySpan<string> texts,  int maxTokenCount, Tensor<T> inputIds, Tensor<T> inputMask, Tensor<T> tokenTypeIds) 
              where T: INumber<T>;
}

Maybe instead of Tensor you want to use TensorSpan?

Where the string allocations are removed if not needed, and the other internal allocations optimized.
This API will enable me to pool Tensors, and removing casting from the int to long for my models.

Describe alternatives you've considered
I have implemented my own batch tokenizer: https://github.com/tjwald/high-perf-ML/blob/develop/ML.Infra/Tokenization/PretrainedTokenizer.cs.

Additional context
Continuing this ticket: microsoft/semantic-kernel#9793 on the tokenization part.

@tjwald tjwald added the enhancement New feature or request label Jan 23, 2025
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged label Jan 23, 2025
@luisquintanilla
Copy link
Contributor

@tarekgh tarekgh added this to the ML.NET 5.0 milestone Mar 3, 2025
@tarekgh tarekgh added Tokenizers and removed untriaged New issue has not been triaged labels Mar 3, 2025
@tarekgh tarekgh self-assigned this Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Tokenizers
Projects
None yet
Development

No branches or pull requests

3 participants