Performance: Accelerate BERTweet CPU Inference via ONNX Runtime & Hugging Face Optimum

### Description 📝
While analyzing the sentiment pipeline for the upcoming GSoC cycle, I profiled the current execution of `finiteautomata/bertweet-base-sentiment-analysis`. Since the `config.yaml` defaults to CPU execution, running native PyTorch Transformers introduces significant latency during the sequential processing of audio chunks.

### Problem
- Native PyTorch inference on CPU for timestamped video/audio segments creates a massive bottleneck.
- High memory overhead limits the number of concurrent usability sessions the API can handle.

### Proposed Solution
Instead of rewriting the entire inference engine, we can swap the native Hugging Face pipeline with `optimum.onnxruntime`. 
1. Export the BERTweet model to ONNX graph format dynamically via `ORTModelForSequenceClassification`.
2. Execute the pipeline using `onnxruntime` backend.

This maintains 100% backward compatibility with the existing Flask routes but will yield a ~2x-3x speedup on CPU inference and lower the RAM footprint.

I am working on this implementation locally and will open a PR shortly to demonstrate the latency reduction!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: Accelerate BERTweet CPU Inference via ONNX Runtime & Hugging Face Optimum #18

Description 📝

Problem

Proposed Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance: Accelerate BERTweet CPU Inference via ONNX Runtime & Hugging Face Optimum #18

Description

Description 📝

Problem

Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions