-
Notifications
You must be signed in to change notification settings - Fork 46
Performance: Accelerate BERTweet CPU Inference via ONNX Runtime & Hugging Face Optimum #18
Copy link
Copy link
Open
Description
Description 📝
While analyzing the sentiment pipeline for the upcoming GSoC cycle, I profiled the current execution of finiteautomata/bertweet-base-sentiment-analysis. Since the config.yaml defaults to CPU execution, running native PyTorch Transformers introduces significant latency during the sequential processing of audio chunks.
Problem
- Native PyTorch inference on CPU for timestamped video/audio segments creates a massive bottleneck.
- High memory overhead limits the number of concurrent usability sessions the API can handle.
Proposed Solution
Instead of rewriting the entire inference engine, we can swap the native Hugging Face pipeline with optimum.onnxruntime.
- Export the BERTweet model to ONNX graph format dynamically via
ORTModelForSequenceClassification. - Execute the pipeline using
onnxruntimebackend.
This maintains 100% backward compatibility with the existing Flask routes but will yield a ~2x-3x speedup on CPU inference and lower the RAM footprint.
I am working on this implementation locally and will open a PR shortly to demonstrate the latency reduction!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels