You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Most AI systems use batching for performance reasons, needing all tokenized sentences being the same size and outputting a mask of which values are padding.
In my project I had to implement this myself. The issues are mostly performance and API compatibility with the ecosystem.
With my solution - There are megabytes of allocations:
The Int64[] allocations are due to widening needed to be done since the ONNX model needs Tensor as input.
The int32[] allocations are the actual tokens.
The strings allocations are token strings that are not used and are thrown away.
The other allocations are internal, and I don't know what they are.
Describe the solution you'd like
Enable 0 allocations solution via an API like the following:
Maybe instead of Tensor you want to use TensorSpan?
Where the string allocations are removed if not needed, and the other internal allocations optimized.
This API will enable me to pool Tensors, and removing casting from the int to long for my models.
Is your feature request related to a problem? Please describe.
Most AI systems use batching for performance reasons, needing all tokenized sentences being the same size and outputting a mask of which values are padding.
In my project I had to implement this myself. The issues are mostly performance and API compatibility with the ecosystem.
With my solution - There are megabytes of allocations:
The Int64[] allocations are due to widening needed to be done since the ONNX model needs Tensor as input.
The int32[] allocations are the actual tokens.
The strings allocations are token strings that are not used and are thrown away.
The other allocations are internal, and I don't know what they are.
Describe the solution you'd like
Enable 0 allocations solution via an API like the following:
Maybe instead of Tensor you want to use TensorSpan?
Where the string allocations are removed if not needed, and the other internal allocations optimized.
This API will enable me to pool Tensors, and removing casting from the
int
tolong
for my models.Describe alternatives you've considered
I have implemented my own batch tokenizer: https://github.com/tjwald/high-perf-ML/blob/develop/ML.Infra/Tokenization/PretrainedTokenizer.cs.
Additional context
Continuing this ticket: microsoft/semantic-kernel#9793 on the tokenization part.
The text was updated successfully, but these errors were encountered: