Skip to content

Performance: Accelerate BERTweet CPU Inference via ONNX Runtime & Hugging Face Optimum #18

@kushagarwal2910-lang

Description

@kushagarwal2910-lang

Description 📝

While analyzing the sentiment pipeline for the upcoming GSoC cycle, I profiled the current execution of finiteautomata/bertweet-base-sentiment-analysis. Since the config.yaml defaults to CPU execution, running native PyTorch Transformers introduces significant latency during the sequential processing of audio chunks.

Problem

  • Native PyTorch inference on CPU for timestamped video/audio segments creates a massive bottleneck.
  • High memory overhead limits the number of concurrent usability sessions the API can handle.

Proposed Solution

Instead of rewriting the entire inference engine, we can swap the native Hugging Face pipeline with optimum.onnxruntime.

  1. Export the BERTweet model to ONNX graph format dynamically via ORTModelForSequenceClassification.
  2. Execute the pipeline using onnxruntime backend.

This maintains 100% backward compatibility with the existing Flask routes but will yield a ~2x-3x speedup on CPU inference and lower the RAM footprint.

I am working on this implementation locally and will open a PR shortly to demonstrate the latency reduction!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions