Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Workers AI]Add embedding model token and batch size limits #18975

Open
wants to merge 1 commit into
base: production
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 34 additions & 20 deletions src/content/docs/workers-ai/platform/limits.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,9 @@ pcx_content_type: configuration
title: Limits
sidebar:
order: 2

---

import { Render } from "~/components"
import { Render } from "~/components";

Workers AI is now Generally Available. We've updated our rate limits to reflect this.

Expand All @@ -20,48 +19,63 @@ Rate limits are default per task type, with some per-model limits defined as fol

### [Automatic Speech Recognition](/workers-ai/models/#automatic-speech-recognition)

* 720 requests per minute
- 720 requests per minute

### [Image Classification](/workers-ai/models/#image-classification)

* 3000 requests per minute
- 3000 requests per minute

### [Image-to-Text](/workers-ai/models/#image-to-text)

* 720 requests per minute
- 720 requests per minute

### [Object Detection](/workers-ai/models/#object-detection)

* 3000 requests per minute
- 3000 requests per minute

### [Summarization](/workers-ai/models/#summarization)

* 1500 requests per minute
- 1500 requests per minute

### [Text Classification](/workers-ai/models/#text-classification)

* 2000 requests per minute
- 2000 requests per minute

### [Text Embeddings](/workers-ai/models/#text-embeddings)

* 3000 requests per minute
* [@cf/baai/bge-large-en-v1.5](/workers-ai/models/bge-large-en-v1.5/) is 1500 requests per minute
- 3000 requests per minute
- [@cf/baai/bge-large-en-v1.5](/workers-ai/models/bge-large-en-v1.5/) is 1500 requests per minute

#### Additional limits for Embedding Models

When using `@cf/baai/bge` embedding models, the following limits apply:

- The maximum token limit per input is 512 tokens.
- The maximum batch size is100 inputs per request.
- The total number of tokens across all inputs in the batch must not exceed internal processing limits.
- Larger inputs (closer to 512 tokens) may reduce the maximum batch size due to these constraints.

#### Behavior and constraints

1. Exceeding the batch size limit:If more than 100 inputs are provided, a `400 Bad Request` error is returned.
2. Exceeding the token limit per input: If a single input exceeds 512 tokens, the request will fail with a `400 Bad Request` error.
3. Combined constraints:Requests with both a high batch size and large token inputs may fail due to exceeding the model's processing limits.

### [Text Generation](/workers-ai/models/#text-generation)

* 300 requests per minute
* [@hf/thebloke/mistral-7b-instruct-v0.1-awq](/workers-ai/models/mistral-7b-instruct-v0.1-awq/) is 400 requests per minute
* [@cf/microsoft/phi-2](/workers-ai/models/phi-2/) is 720 requests per minute
* [@cf/qwen/qwen1.5-0.5b-chat](/workers-ai/models/qwen1.5-0.5b-chat/) is 1500 requests per minute
* [@cf/qwen/qwen1.5-1.8b-chat](/workers-ai/models/qwen1.5-1.8b-chat/) is 720 requests per minute
* [@cf/qwen/qwen1.5-14b-chat-awq](/workers-ai/models/qwen1.5-14b-chat-awq/) is 150 requests per minute
* [@cf/tinyllama/tinyllama-1.1b-chat-v1.0](/workers-ai/models/tinyllama-1.1b-chat-v1.0/) is 720 requests per minute
- 300 requests per minute
- [@hf/thebloke/mistral-7b-instruct-v0.1-awq](/workers-ai/models/mistral-7b-instruct-v0.1-awq/) is 400 requests per minute
- [@cf/microsoft/phi-2](/workers-ai/models/phi-2/) is 720 requests per minute
- [@cf/qwen/qwen1.5-0.5b-chat](/workers-ai/models/qwen1.5-0.5b-chat/) is 1500 requests per minute
- [@cf/qwen/qwen1.5-1.8b-chat](/workers-ai/models/qwen1.5-1.8b-chat/) is 720 requests per minute
- [@cf/qwen/qwen1.5-14b-chat-awq](/workers-ai/models/qwen1.5-14b-chat-awq/) is 150 requests per minute
- [@cf/tinyllama/tinyllama-1.1b-chat-v1.0](/workers-ai/models/tinyllama-1.1b-chat-v1.0/) is 720 requests per minute

### [Text-to-Image](/workers-ai/models/#text-to-image)

* 720 requests per minute
* [@cf/runwayml/stable-diffusion-v1-5-img2img](/workers-ai/models/stable-diffusion-v1-5-img2img/) is 1500 requests per minute
- 720 requests per minute
- [@cf/runwayml/stable-diffusion-v1-5-img2img](/workers-ai/models/stable-diffusion-v1-5-img2img/) is 1500 requests per minute

### [Translation](/workers-ai/models/#translation)

* 720 requests per minute
- 720 requests per minute
Loading