Skip to content

It will automatically batch inference requests from multiple independent users together in a single batch request for efficiency, so that for users the interface looks like individual requests, but internally it is handled as a batch request

Notifications You must be signed in to change notification settings

sitetester/auto-batching-proxy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Auto Batching Proxy

It will automatically batch inference requests from multiple independent users together in a single batch request for efficiency, so that for users the interface looks like individual requests, but internally it is handled as a batch request, essentially it provide a REST API wrapper around some inference service like https://github.com/huggingface/text-embeddings-inference

Proxy server is configured with following parameters:

Max Wait Time - maximal time user request can wait for other requests to be accumulated in a batch
Max Batch Size - maximal number of requests that can be accumulated in a batch.

Setup Inference Service

First, try running inference service in a container with --model-id nomic-ai/nomic-embed-text-v1.5

docker run --rm -it -p 8080:80 --pull always \
    ghcr.io/huggingface/text-embeddings-inference:cpu-latest \
--model-id nomic-ai/nomic-embed-text-v1.5

if it fails to start, then try with some other alternatives. Currently, code is functional for
--model-id sentence-transformers/all-MiniLM-L6-v2 & --model-id sentence-transformers/all-mpnet-base-v2 Check /screenshots for some of the tried models.

docker run --rm -it -p 8080:80 --pull always \
  ghcr.io/huggingface/text-embeddings-inference:cpu-latest \
  --model-id sentence-transformers/all-MiniLM-L6-v2

Note: Backend does not support a batch size > 8 but our proxy will respect this config param & will not send requests (as well as max inputs, which is 32 for all-MiniLM-L6-v2) more than supported batch size.

Setup Proxy Service

RUST_LOG=INFO cargo run -- --max-batch-size 50 --max-wait-time-ms 3000

Unit tests
Relevant unit tests are provided inside /src source code files

Integration tests
Check the /tests folder, code is covered with various scenarios.

Run all tests via cargo test. Currently, tests are verified to be passed against
--model-id sentence-transformers/all-MiniLM-L6-v2 & --model-id sentence-transformers/all-mpnet-base-v2 & they also explain how/why which part of code was written for which particular use case.

Use the following simple CURL commands for quick testing

  • for inference
curl -X POST http://localhost:8080/embed \
  -H "Content-Type: application/json" \
  -d '{"inputs": ["Hello world"]}'
  • for proxy
curl -X POST http://localhost:3000/embed \
  -H "Content-Type: application/json" \
  -d '{"inputs": ["Hello", "World"]}'

to verify proxy is working for multiple concurrent requests

cd scripts 
./proxy_concurrent_calls.sh

Benchmark test results:
Following output is taken from

$ RUSTFLAGS="-A dead_code" cargo test test_compare_single_input_inference_service_vs_auto_batching_proxy_with_x_separate_requests -- --nocapture
  • -A shortcut for allow (to suppress warnings generated for unused functions, even though they are actually used in tests)
  • --nocapture will recover display output

Full output: timing_summary.png

About

It will automatically batch inference requests from multiple independent users together in a single batch request for efficiency, so that for users the interface looks like individual requests, but internally it is handled as a batch request

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published