Skip to content

feat Adding tenant based rate limiting based on input tokens for HTTP router#1397

Open
shenoyvvarun wants to merge 1 commit into
lightseekorg:mainfrom
shenoyvvarun:vasheno/input_token_limiting
Open

feat Adding tenant based rate limiting based on input tokens for HTTP router#1397
shenoyvvarun wants to merge 1 commit into
lightseekorg:mainfrom
shenoyvvarun:vasheno/input_token_limiting

Conversation

@shenoyvvarun
Copy link
Copy Markdown

@shenoyvvarun shenoyvvarun commented Apr 27, 2026

Description

Problem

  • This is a continuation of PR feat(smg): add tenant resolution request metadata #1312 to handle token level throttling.
  • The design introduces a new rate_limit subsystem that performs token-level admission and settlement for tenant/model fairness.
  • Middleware resolves tenant metadata, handlers pass it explicitly into routers.
  • routers reserve once before retry using estimated or exact input tokens, and response processors settle using normalized final usage.
  • Policies come from startup config in v1, compile into immutable snapshots, and resolve to one tenant-global scope plus at most one model-rule scope. The local backend uses per-scope buckets with reserve-once/settle-later accounting; the external backend mirrors the same reservation contract for future centralized enforcement.
  • Key Design Consideration: tokenizer is invoked in order for input reservation. This can be an estimation rather which is settled later.

High level Design

Incoming Request
      |
      v
+----------------------+
| RequestIdLayer       |  correlation only
+----------------------+
      |
      v
+----------------------+
| Serving Auth         |  may attach DataPlaneCaller
+----------------------+
      |
      v
+----------------------+
| Concurrency Gate     |  legacy overload / DoS guard
+----------------------+
      |
      v
+----------------------+
| Rate Limit Metadata  |
| - resolve TenantKey  |
| - make charge UUID   |
+----------------------+
      |
      v
+----------------------+
| Handler / Router     |
| gets RouteRequestMeta|
| + resolved model_id  |
+----------------------+
      |
      v
+-------------------------------+
| Estimate / Exact Input Tokens |
| gRPC: preparation once        |
| HTTP: structured estimate     |
+-------------------------------+
      |
      v
+-------------------------------+
| Reserve Once                  |
| tenant/global                 |
| + optional tenant/rule/<id>   |
+-------------------------------+
      |
      v
+-------------------------------+
| RetryExecutor                 |
| retries reuse reservation     |
+-------------------------------+
      |
      v
+-------------------------------+
| Final Response / Stream       |
| settle from Usage, or close   |
+-------------------------------+
      |
      v
+-------------------------------+
| AttachedBody Drop Safety      |
| abandon if still open         |
+-------------------------------+

Next steps

  • Only input tokens are charged atm, output_tokens next.

  • No reserve-once / settle-later reservation handle yet

  • Per-model layered scopes

  • Local backend only; no external backend

  • No explicit request metadata / charge-id plumbing yet

  • No exact-token gRPC pre-retry preparation split yet

  • Extend the limiter from input-token-only admission to full input + output token accounting

  • Introduce reserve-once / settle-later semantics so retries do not double-charge

    • Add a shared reservation handle with explicit terminal outcomes:
    • success with usage
    • success without usage
    • terminal failure after admission

Solution

  • The current implementation adds a local tenant limiter that enforces per-tenant budgets using estimated input tokens and optional per-tenant request-per-minute limits.
    It wires the limiter into app startup and HTTP routing, and adds unit tests for limiter behavior and policy lookup.

Test cases (Passed)

1. Tenant isolation

Goal: prove one tenant does not affect another.

Test:
exhaust team-a
immediately send same request as team-b

Expected:
team-a gets 429
team-b still succeeds

2. Default policy fallback

Goal: verify tenants without explicit override use the default policy.

Test:
send requests with:
x-smg-tenant-id: unknown-team
do not define header:unknown-team in CLI config

Expected:
default TPM/RPM limits apply
not unlimited

Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

  • New Features

    • Multi-tenant in-memory rate limiting with configurable per-tenant token and request budgets
    • CLI flags to enable rate limiting, set default budgets, and provide repeatable per-tenant overrides
    • Token estimation (model-aware tokenizer with heuristic fallback) to enforce tenant budgets before proxying
    • Standardized rate-limit error responses (including terminal rejections)
  • Behavior Changes

    • Requests exceeding limits return rate-limit responses (Retry-After when applicable)
    • Invalid per-tenant override strings now fail startup validation

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a multi-tenant rate-limiting subsystem: new rate-limit crate module (types, local in-memory limiter, error helper), config and CLI hooks, AppContext wiring, tokenizer-aware token estimation, and per-tenant enforcement in HTTP/PD routers with early rate-limit responses.

Changes

Cohort / File(s) Summary
Rate Limit Core
model_gateway/src/rate_limit/types.rs, model_gateway/src/rate_limit/local.rs, model_gateway/src/rate_limit/error.rs, model_gateway/src/rate_limit/mod.rs
New multi-tenant config/models, LocalTokenRateLimiter with per-tenant buckets/refill/eviction, terminal vs retryable rejection semantics, rate_limit_exceeded_response helper, and public re-exports.
Configuration & Builder
model_gateway/src/config/types.rs, model_gateway/src/config/builder.rs, model_gateway/src/main.rs
Adds multi_tenant_rate_limit to RouterConfig; builder APIs for enable/defaults/per-tenant overrides; CLI parsing for repeatable tenant_key:tpm:rpm overrides with validation and unit tests for duplicate/round-trip behavior.
App Context & Crate Export
model_gateway/src/app_context.rs, model_gateway/src/lib.rs
AppContext/AppContextBuilder gain pub token_rate_limiter: Option<Arc<LocalTokenRateLimiter>>; crate exports new rate_limit module.
HTTP Routers & PD Router
model_gateway/src/routers/http/router.rs, model_gateway/src/routers/http/pd_router.rs
Routers now include a TokenizerRegistry and optional app_token_rate_limiter; add tokenizer-backed token estimation (fallback chars/4) and per-tenant check_and_consume with early rate-limit responses on generate/chat/completion/rerank/etc.; PD router gets manual Debug impl and new fields.
Router Construction, Tests & Service Discovery
model_gateway/src/routers/http/... (tests and builder usages), model_gateway/src/service_discovery.rs
Router constructors and test fixtures updated to pass tokenizer registry and optional limiter; test AppContext initialization explicitly sets token_rate_limiter = None.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Router
    participant TokenizerRegistry
    participant RateLimiter as LocalTokenRateLimiter
    participant Bucket as TenantBucket
    participant Upstream
    participant ClientResponse as Response

    Client->>Router: HTTP request (tenant_key, payload)
    Router->>TokenizerRegistry: estimate_tokens(payload)
    Router->>RateLimiter: check_and_consume(tenant_key, estimated_tokens)
    RateLimiter->>Bucket: lookup/create tenant bucket
    Bucket->>Bucket: refill tokens/requests based on elapsed time
    Bucket->>Bucket: evaluate admission (tokens & requests)
    alt admitted
        Bucket->>RateLimiter: Ok
        Router->>Upstream: proxy request to handler/service
        Upstream->>Router: 200/response
        Router->>ClientResponse: return 200
        ClientResponse->>Client: 200 OK
    else rejected
        Bucket->>RateLimiter: Err(retry_after_secs)
        RateLimiter->>Router: Err(retry_after_secs)
        Router->>ClientResponse: 429 (or terminal 413) with optional Retry-After
        ClientResponse->>Client: 429/413 Too Many Requests / Payload Too Large
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

tests

Suggested reviewers

  • CatherineSue
  • key4ng
  • slin1237

Poem

🐇 I count the tokens, soft and bright,
Tenants queue beneath the night.
Buckets hum and refill slow —
Hop, wait, retry; onward we go.
Small carrot cheers for every flow.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.94% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately describes the main feature being added: tenant-based rate limiting using input tokens for the HTTP router.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the model-gateway Model gateway crate changes label Apr 27, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 27, 2026

Hi @shenoyvvarun, the DCO sign-off check has failed. All commits must include a Signed-off-by line.

To fix existing commits:

# Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-lease

To sign off future commits automatically:

  • Use git commit -s every time, or
  • VSCode: enable Git: Always Sign Off in Settings
  • PyCharm: enable Sign-off commit in the Commit tool window

@shenoyvvarun shenoyvvarun force-pushed the vasheno/input_token_limiting branch from 2ef701f to 619e73e Compare April 27, 2026 00:25
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a multi-tenant rate limiting system that enforces per-tenant limits on requests and tokens per minute. It includes a new LocalTokenRateLimiter integrated into the HTTP router, which utilizes model-specific tokenizers for usage estimation. The review feedback identifies a bug in the policy fallback logic where default limits were bypassed, suggests optimizing DashMap access to reduce lock contention, and recommends deferring tokenization until it is verified that token-based limits are applicable to the tenant.

Comment thread model_gateway/src/rate_limit/types.rs Outdated
Comment thread model_gateway/src/rate_limit/local.rs Outdated
Comment thread model_gateway/src/rate_limit/types.rs
Comment thread model_gateway/src/rate_limit/local.rs Outdated
Comment thread model_gateway/src/routers/http/router.rs Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2ef701f02b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread model_gateway/src/rate_limit/types.rs Outdated
Comment thread model_gateway/src/rate_limit/local.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
model_gateway/src/app_context.rs (1)

514-534: 🧹 Nitpick | 🔵 Trivial

Method name no longer reflects what it does.

maybe_rate_limiter now configures two unrelated limiters (TokenBucket for max_concurrent_requests and LocalTokenRateLimiter for multi-tenant token budgets). Since these have independent enable conditions and lifecycles, consider splitting into maybe_rate_limiter and maybe_token_rate_limiter (chained from from_config) so the call site self-documents which feature each branch enables. Not a correctness issue, just keeps the builder steps cohesive as more limiter variants land in follow-up PRs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/app_context.rs` around lines 514 - 534, The method
maybe_rate_limiter currently configures both the concurrent-request TokenBucket
and the multi-tenant LocalTokenRateLimiter, which mixes independent features;
split this into two builder steps by leaving maybe_rate_limiter to only handle
max_concurrent_requests/TokenBucket and add a new maybe_token_rate_limiter (or
maybe_multi_tenant_rate_limiter) that sets token_rate_limiter from
RouterConfig.multi_tenant_rate_limit using LocalTokenRateLimiter; update the
call site (from_config) to call .maybe_rate_limiter(...) then
.maybe_token_rate_limiter(...) so each limiter (TokenBucket,
LocalTokenRateLimiter) has its own enable condition and lifecycle and the code
is self-documenting.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/config/builder.rs`:
- Around line 602-616: The tenant_rate_limit method silently overwrites
duplicate keys when calling self.config.multi_tenant_rate_limit.tenants.insert;
change it to detect an existing entry (use HashMap::entry or contains_key/get)
and emit a tracing::warn! that includes the tenant_key and both the previous and
new policy before overwriting, or alternatively return a ConfigError from
build_with_validation for duplicates; ensure you clone/format the tenant_key (or
check using a borrowed key) so the log/error can include the key and policy
details.
- Around line 583-616: Add a unit test that round-trips the new builder methods
through RouterConfig::build and asserts the fields are set correctly: create a
test (e.g., test_builder_multi_tenant_rate_limit_round_trip) that uses
RouterConfigBuilder::new(), calls multi_tenant_rate_limit_enabled(true),
default_tokens_per_minute(...), default_requests_per_minute(...), and
tenant_rate_limit(...) for two tenants, then .build().unwrap() and assert
config.multi_tenant_rate_limit.enabled, default_tokens_per_minute,
default_requests_per_minute, the specific tenant entries (tokens_per_minute and
requests_per_minute) for both tenants, and that tenants.len() == 2 to ensure no
regressions in the builder methods (multi_tenant_rate_limit_enabled,
default_tokens_per_minute, default_requests_per_minute, tenant_rate_limit).

In `@model_gateway/src/rate_limit/local.rs`:
- Around line 81-100: Before computing debt and retry_after, check for the
terminal condition where estimated_tokens (or need_tokens) exceeds the bucket's
full capacity (bucket.policy.tokens_per_minute > 0 and estimated_tokens as f64 >
bucket.policy.tokens_per_minute as f64); if true, return a
non-retryable/terminal rejection instead of computing token_retry so callers
don't repeatedly retry a request that will never fit. Modify the logic around
token_denied/token_retry in local.rs (the branch that computes
debt/(tokens_per_minute as f64 / 60.0)) to perform this capacity check first and
use the existing API for a terminal rejection rather than a finite retry_after.

In `@model_gateway/src/rate_limit/types.rs`:
- Around line 28-38: The fallback currently returns the zeroed singleton
DEFAULT_POLICY from the method that looks up tenant policy (the
tenants.get(...).or_else(...) block), which ignores configured defaults
(default_tokens_per_minute/default_requests_per_minute) and causes unlimited
behavior; change the fallback to construct and return a TenantTokenPolicy
populated from self.default_tokens_per_minute and
self.default_requests_per_minute instead of &DEFAULT_POLICY (i.e., produce an
owned TenantTokenPolicy with the configured values), and then remove the extra
.cloned() at the caller in LocalTokenRateLimiter::check_and_consume since the
method will already return an owned policy.

In `@model_gateway/src/routers/http/router.rs`:
- Around line 96-108: enforce_token_rate_limit currently commits token usage via
limiter.check_and_consume(...) which can overcharge on retries or pre-flight
failures; add metrics to make this visible: after a successful check_and_consume
call inside enforce_token_rate_limit increment an "tokens_admitted" counter (or
a scoped meter on app_token_rate_limiter), and then update the error paths that
can occur after admission (e.g., in route_typed_request_once,
select_worker_for_model failure paths, and
RetryExecutor::execute_response_with_retry when upstream returns 5xx or client
disconnects) to increment an "admitted_but_failed" counter so operators can
monitor overcharged requests; use the existing limiter or global metrics
facility and refer to the functions enforce_token_rate_limit, check_and_consume,
rate_limit_exceeded_response, route_typed_request_once, select_worker_for_model,
and RetryExecutor::execute_response_with_retry to locate where to increment each
metric.
- Around line 79-94: The count_tokens function currently falls back to
(text.chars().count() / 4) when tokenizer lookup or encode fails, which
underestimates tokens for CJK and code; update the fallback in router.rs (inside
count_tokens, the branch handling self.tokenizer_registry.get(model_id) being
None and the encode unwrap_or_else case) to use a conservative estimator such as
text.chars().count().max(1) or bytes()/3 (e.g., text.len() / 3). Keep the final
.max(1) to avoid zero, and apply the same conservative fallback to both the
missing-tokenizer path and the encode-error path; also add a short doc comment
on the public limiter type mentioning the estimator is conservative and that
configured TPM is not a hard ceiling when tokenizer is missing.

---

Outside diff comments:
In `@model_gateway/src/app_context.rs`:
- Around line 514-534: The method maybe_rate_limiter currently configures both
the concurrent-request TokenBucket and the multi-tenant LocalTokenRateLimiter,
which mixes independent features; split this into two builder steps by leaving
maybe_rate_limiter to only handle max_concurrent_requests/TokenBucket and add a
new maybe_token_rate_limiter (or maybe_multi_tenant_rate_limiter) that sets
token_rate_limiter from RouterConfig.multi_tenant_rate_limit using
LocalTokenRateLimiter; update the call site (from_config) to call
.maybe_rate_limiter(...) then .maybe_token_rate_limiter(...) so each limiter
(TokenBucket, LocalTokenRateLimiter) has its own enable condition and lifecycle
and the code is self-documenting.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2f8600a7-8717-4fb3-a379-34b3c64a181e

📥 Commits

Reviewing files that changed from the base of the PR and between e4d4452 and 2ef701f.

📒 Files selected for processing (11)
  • model_gateway/src/app_context.rs
  • model_gateway/src/config/builder.rs
  • model_gateway/src/config/types.rs
  • model_gateway/src/lib.rs
  • model_gateway/src/main.rs
  • model_gateway/src/rate_limit/error.rs
  • model_gateway/src/rate_limit/local.rs
  • model_gateway/src/rate_limit/mod.rs
  • model_gateway/src/rate_limit/types.rs
  • model_gateway/src/routers/http/router.rs
  • model_gateway/src/service_discovery.rs

Comment thread model_gateway/src/config/builder.rs
Comment thread model_gateway/src/config/builder.rs
Comment thread model_gateway/src/rate_limit/local.rs
Comment thread model_gateway/src/rate_limit/types.rs Outdated
Comment thread model_gateway/src/routers/http/router.rs Outdated
Comment thread model_gateway/src/routers/http/router.rs
@slin1237 slin1237 added enhancement New feature or request priority:high High priority labels Apr 27, 2026
@slin1237 slin1237 self-assigned this Apr 27, 2026
@shenoyvvarun shenoyvvarun force-pushed the vasheno/input_token_limiting branch from 619e73e to bb99ad1 Compare April 27, 2026 04:38
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bb99ad1fdb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread model_gateway/src/routers/http/router.rs Outdated
Comment thread model_gateway/src/routers/http/router.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
model_gateway/src/routers/http/router.rs (1)

1252-1276: ⚠️ Potential issue | 🟡 Minor

Apply rate limit to route_rerank.

route_rerank ignores _tenant_meta and bypasses tenant rate limiting, unlike route_chat, route_completion, route_responses, route_embeddings, and route_classify. Since RerankRequest implements GenerationRequest, the same enforce_token_rate_limit check can be applied. Rerank payloads contain a query plus candidate documents that consume meaningful input tokens, so this endpoint should respect tenant TPM quotas.

♻️ Apply rate limit to rerank
     async fn route_rerank(
         &self,
         headers: Option<&HeaderMap>,
-        _tenant_meta: &TenantRequestMeta,
+        tenant_meta: &TenantRequestMeta,
         body: &RerankRequest,
         model_id: &str,
     ) -> Response {
+        if let Some(response) = self.enforce_token_rate_limit(tenant_meta, body, model_id) {
+            return response;
+        }
         let response = self
             .route_typed_request(headers, body, "/v1/rerank", model_id)
             .await;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/http/router.rs` around lines 1252 - 1276, Add
tenant token-rate enforcement to route_rerank the same way other generation
endpoints do: before calling route_typed_request inside async fn route_rerank,
call enforce_token_rate_limit(_tenant_meta, body) (treating body as a
GenerationRequest) and if it returns an Err response return that response
immediately; only proceed to route_typed_request and the existing
build_rerank_response on success. Reference route_rerank,
enforce_token_rate_limit, and GenerationRequest to locate and mirror the
rate-limit check pattern used in route_chat/route_completion/route_embeddings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/main.rs`:
- Around line 1292-1305: The parsing of self.tenant_rate_limits using rsplitn(3,
':') currently allows an empty tenant_key (e.g. ":1000:10") which silently
installs a global override; update the validation in the loop that builds
tenant_rate_limit so that after extracting tenant_key, tpm and rpm you also
reject empty or all-whitespace tenant keys (use tenant_key.trim().is_empty() or
tenant_key.is_empty()) and return ConfigError::ValidationFailed with a clear
message when the tenant_key is empty instead of calling
builder.tenant_rate_limit; keep existing checks for tpm/rpm and only call
builder.tenant_rate_limit(tenant_key, tpm, rpm) when tenant_key is non-empty.

In `@model_gateway/src/rate_limit/error.rs`:
- Around line 16-24: The code currently treats
TERMINAL_REJECTION_RETRY_AFTER_SECS the same as transient 429
tenant_rate_limit_exceeded responses; change the branch that checks if
retry_after_secs == TERMINAL_REJECTION_RETRY_AFTER_SECS to build and return a
distinct error response (e.g., set a different HTTP status like 413 or a custom
4xx, and modify the body/message to indicate a terminal oversize rejection)
instead of returning the existing response; update the branch that currently
only omits the RETRY_AFTER header so it constructs the new response (including
an informative message referencing the tenant capacity limit) while leaving
other logic intact and keep symbol names TERMINAL_REJECTION_RETRY_AFTER_SECS,
RETRY_AFTER and tenant_rate_limit_exceeded to locate the change.

In `@model_gateway/src/rate_limit/local.rs`:
- Around line 23-69: LocalTokenRateLimiter currently stores every seen
tenant_key indefinitely in buckets: DashMap<String, Arc<Bucket>>, which causes
unbounded memory growth (see LocalTokenRateLimiter, buckets, Bucket and
BucketState.last_refill) and also freezes a policy snapshot at creation
(policy.clone()). Update the code and docs: add a doc comment on
LocalTokenRateLimiter warning that bucket count grows with distinct tenant_keys
and recommend pairing with RouterConfigBuilder::trust_tenant_header = false or
upstream tenant validation; implement a coarse periodic cleanup task (or a sweep
method invoked from a timer) that iterates buckets and removes entries whose
BucketState.last_refill is older than a configurable TTL (e.g. 10× refill
window) to bound cardinality; ensure cleanup uses DashMap removal APIs safely
and note that existing buckets hold cloned policy so runtime policy updates
won’t affect already-allocated buckets (document that behavior and consider
invalidating/replacing buckets on policy change).

---

Outside diff comments:
In `@model_gateway/src/routers/http/router.rs`:
- Around line 1252-1276: Add tenant token-rate enforcement to route_rerank the
same way other generation endpoints do: before calling route_typed_request
inside async fn route_rerank, call enforce_token_rate_limit(_tenant_meta, body)
(treating body as a GenerationRequest) and if it returns an Err response return
that response immediately; only proceed to route_typed_request and the existing
build_rerank_response on success. Reference route_rerank,
enforce_token_rate_limit, and GenerationRequest to locate and mirror the
rate-limit check pattern used in route_chat/route_completion/route_embeddings.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: bfd1b7cf-3733-404b-9d44-8889b9881474

📥 Commits

Reviewing files that changed from the base of the PR and between 2ef701f and bb99ad1.

📒 Files selected for processing (11)
  • model_gateway/src/app_context.rs
  • model_gateway/src/config/builder.rs
  • model_gateway/src/config/types.rs
  • model_gateway/src/lib.rs
  • model_gateway/src/main.rs
  • model_gateway/src/rate_limit/error.rs
  • model_gateway/src/rate_limit/local.rs
  • model_gateway/src/rate_limit/mod.rs
  • model_gateway/src/rate_limit/types.rs
  • model_gateway/src/routers/http/router.rs
  • model_gateway/src/service_discovery.rs

Comment thread model_gateway/src/main.rs
Comment thread model_gateway/src/rate_limit/error.rs Outdated
Comment thread model_gateway/src/rate_limit/local.rs
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 28, 2026

Hi @shenoyvvarun, the DCO sign-off check has failed. All commits must include a Signed-off-by line.

To fix existing commits:

# Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-lease

To sign off future commits automatically:

  • Use git commit -s every time, or
  • VSCode: enable Git: Always Sign Off in Settings
  • PyCharm: enable Sign-off commit in the Commit tool window

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c504ad3962

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread model_gateway/src/routers/http/router.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
model_gateway/src/routers/http/router.rs (1)

79-93: ⚠️ Potential issue | 🟡 Minor

Use a conservative fallback when tokenizer lookup fails.

Lines 85-92 still fall back to (chars / 4), which underestimates CJK/code-heavy prompts and lets tenants overshoot TPM whenever a tokenizer is missing or encode() errors. Please switch to a conservative upper bound instead; the same estimator was copied into model_gateway/src/routers/http/pd_router.rs, so both helpers should be fixed together.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/http/router.rs` around lines 79 - 93, The
token-count fallback in count_tokens currently underestimates tokens by using
(chars / 4); update the fallback to a conservative upper bound (e.g., use
text.chars().count() as u32 .max(1)) whenever tokenizer lookup
(tokenizer_registry.get) fails or encode() errors, and apply the same change to
the sibling helper in pd_router.rs so both use the conservative chars-based
estimator instead of dividing by 4.
model_gateway/src/main.rs (1)

1297-1305: ⚠️ Potential issue | 🟡 Minor

Also reject whitespace-only tenant keys.

Line 1298 only checks is_empty(), so a value like " :1000:10" still creates an override instead of failing startup. Use tenant_key.trim().is_empty() before calling builder.tenant_rate_limit(...).

Suggested fix
-                if tenant_key.is_empty() {
+                if tenant_key.trim().is_empty() {
                     return Err(ConfigError::ValidationFailed {
                         reason: format!(
                             "invalid --tenant-rate-limit '{spec}'; expected tenant_key:tpm:rpm"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/main.rs` around lines 1297 - 1305, The validation currently
only checks tenant_key.is_empty(), so whitespace-only keys slip through; update
the check inside the if-let (Some(tenant_key), Some(tpm), Some(rpm)) block to
reject keys where tenant_key.trim().is_empty(), and when calling
builder.tenant_rate_limit(tenant_key, tpm, rpm) use the trimmed key (e.g., let
key = tenant_key.trim();) so you both validate and pass a normalized non-empty
tenant key to tenant_rate_limit; keep the same ConfigError::ValidationFailed
path when the trimmed key is empty.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/routers/http/pd_router.rs`:
- Around line 104-116: When enforce_token_rate_limit returns an early 429/413,
emit the normal router metrics first: call
self.metrics.record_router_request(...) with the tenant_meta and model_id, then
build the rate-limited Response via
crate::rate_limit::rate_limit_exceeded_response(...) and call
self.metrics.record_router_upstream_response(...) (and the terminal
self.metrics.record_router_error(...) if your error path normally records
terminal errors) using the generated response's status so the router telemetry
sees throttled PD traffic; do this inside enforce_token_rate_limit (use the
existing count_tokens, limiter.check_and_consume, and the
rate_limit_exceeded_response result) before returning the Response.

In `@model_gateway/src/routers/http/router.rs`:
- Around line 96-108: enforce_token_rate_limit currently returns Some(Response)
for throttled requests and bypasses the existing metrics paths
(record_router_request, record_router_upstream_response, record_router_error)
used by route_typed_request and route_multipart_transcription; update
enforce_token_rate_limit so that when it generates a 429/413 it also records the
appropriate router metrics before returning (or alternatively return a typed
error/enum so the caller route_typed_request/route_multipart_transcription can
invoke the usual record_router_* hooks), referencing enforce_token_rate_limit,
route_typed_request, route_multipart_transcription, and the metric functions
record_router_request / record_router_upstream_response / record_router_error to
locate where to add the metric calls or change the return type.
- Around line 1234-1245: route_audio_transcriptions currently enforces token
rate limits using only TranscriptionRequest, letting a large AudioFile bypass
metering; update this by either skipping token-based admission for this endpoint
or by computing a conservative surrogate token count from the AudioFile (e.g.,
derive duration or size-to-token estimate) and feed that into the admission
check. Concretely, in route_audio_transcriptions (and before calling
enforce_token_rate_limit or in enforce_token_rate_limit itself) compute an
estimated token count from the AudioFile metadata (bytes or duration ->
conservative token estimate) and use that estimate together with
TranscriptionRequest when calling enforce_token_rate_limit, or remove token
admission and rely on other limits if you prefer skipping token-based admission
for this endpoint. Ensure you reference the existing symbols
route_audio_transcriptions, AudioFile, TranscriptionRequest, and
enforce_token_rate_limit (or adjust route_multipart_transcription call) so the
change is localized and clear.

---

Duplicate comments:
In `@model_gateway/src/main.rs`:
- Around line 1297-1305: The validation currently only checks
tenant_key.is_empty(), so whitespace-only keys slip through; update the check
inside the if-let (Some(tenant_key), Some(tpm), Some(rpm)) block to reject keys
where tenant_key.trim().is_empty(), and when calling
builder.tenant_rate_limit(tenant_key, tpm, rpm) use the trimmed key (e.g., let
key = tenant_key.trim();) so you both validate and pass a normalized non-empty
tenant key to tenant_rate_limit; keep the same ConfigError::ValidationFailed
path when the trimmed key is empty.

In `@model_gateway/src/routers/http/router.rs`:
- Around line 79-93: The token-count fallback in count_tokens currently
underestimates tokens by using (chars / 4); update the fallback to a
conservative upper bound (e.g., use text.chars().count() as u32 .max(1))
whenever tokenizer lookup (tokenizer_registry.get) fails or encode() errors, and
apply the same change to the sibling helper in pd_router.rs so both use the
conservative chars-based estimator instead of dividing by 4.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: da364dd7-03a6-45b4-ac39-96a7723f1911

📥 Commits

Reviewing files that changed from the base of the PR and between bb99ad1 and c504ad3.

📒 Files selected for processing (5)
  • model_gateway/src/main.rs
  • model_gateway/src/rate_limit/error.rs
  • model_gateway/src/rate_limit/local.rs
  • model_gateway/src/routers/http/pd_router.rs
  • model_gateway/src/routers/http/router.rs

Comment thread model_gateway/src/routers/http/pd_router.rs
Comment thread model_gateway/src/routers/http/router.rs
Comment thread model_gateway/src/routers/http/router.rs
@shenoyvvarun shenoyvvarun force-pushed the vasheno/input_token_limiting branch from c504ad3 to beab2ef Compare April 28, 2026 20:20
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (5)
model_gateway/src/routers/http/pd_router.rs (2)

104-116: ⚠️ Potential issue | 🟠 Major

Record router metrics before returning a throttled PD response.

execute_dual_dispatch() owns the normal router request/upstream/error metrics. Returning from enforce_token_rate_limit() skips that path entirely, so 429/413 tenant throttles disappear from PD dashboards and alerts.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/http/pd_router.rs` around lines 104 - 116,
enforce_token_rate_limit currently returns a throttled PD Response early which
bypasses execute_dual_dispatch’s router metrics; modify enforce_token_rate_limit
(and/or its callers) to record the same router request/upstream/error metrics
for tenant throttles before returning a 429/413 response: call the existing
router-metrics recording utility used by execute_dual_dispatch (or invoke the
same metric increment paths) right after limiter.check_and_consume reports an
error and before mapping to crate::rate_limit::rate_limit_exceeded_response so
tenant throttles show up in PD dashboards/alerts.

87-101: ⚠️ Potential issue | 🟡 Minor

Use a conservative tokenizer fallback.

chars()/4 materially undercounts CJK, code, and JSON when the tokenizer is missing or encode() fails, so tenants can exceed their configured TPM by a large margin. Charge from a conservative bound in both fallback branches instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/http/pd_router.rs` around lines 87 - 101, In
count_tokens, the fallback estimate chars()/4 undercounts many languages; change
both fallback branches (the None from self.tokenizer_registry.get(model_id) and
the unwrap_or_else on encode()) to use a more conservative estimate (e.g.,
(text.chars().count() as u32 / 2).max(1)) instead of /4 so tenants are charged
conservatively; update the early return and the unwrap_or_else closure in the
count_tokens function accordingly while keeping the final .max(1).
model_gateway/src/routers/http/router.rs (3)

1234-1245: ⚠️ Potential issue | 🟠 Major

/v1/audio/transcriptions is still effectively unmetered.

This admission check only looks at TranscriptionRequest, but the expensive input lives in AudioFile. A large upload with an empty prompt is still charged as ~1 token, so tenant TPM can be bypassed on this endpoint. Either skip token admission here for now or charge from a conservative surrogate derived from the audio payload.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/http/router.rs` around lines 1234 - 1245, The
admission check in route_audio_transcriptions currently calls
enforce_token_rate_limit using only TranscriptionRequest, ignoring the expensive
AudioFile payload; update the admission to either (A) perform the rate/admission
check after inspecting the AudioFile (e.g., compute a conservative surrogate
token count from audio duration or byte length) and pass that surrogate into the
existing enforcement logic, or (B) skip token admission here and rely on a later
check inside route_multipart_transcription; specifically modify
route_audio_transcriptions to inspect the AudioFile (or defer) before/when
calling enforce_token_rate_limit so large uploads cannot bypass tenant TPM using
the audio payload.

79-93: ⚠️ Potential issue | 🟡 Minor

Use a conservative tokenizer fallback.

chars()/4 materially undercounts CJK, code, and JSON when the tokenizer is missing or encode() fails, so tenants can exceed their configured TPM by a large margin. Charge from a conservative bound in both fallback branches instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/http/router.rs` around lines 79 - 93, The current
fallback in count_tokens underestimates tokens by using text.chars().count()/4;
update both fallback branches in count_tokens (the tokenizer_registry.get(None)
branch and the encode() unwrap_or_else failure branch) to use a conservative
estimate such as text.chars().count().max(1) (i.e., assume ~1 token per
character) instead of dividing by 4; keep references to
extract_text_for_routing(), tokenizer_registry, encode(), and count_tokens() so
you change both places consistently to avoid undercharging tenants when
tokenization is missing or fails.

96-108: ⚠️ Potential issue | 🟠 Major

Emit router metrics for early rate-limit rejections.

This helper returns before route_typed_request() or route_multipart_transcription() records request/upstream/error metrics, so throttled 429/413s are invisible to router telemetry.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/http/router.rs` around lines 96 - 108,
enforce_token_rate_limit currently returns early on rate-limit failures so those
429/413 rejections never hit router telemetry; update enforce_token_rate_limit
to record the same router metrics that route_typed_request() /
route_multipart_transcription() would for rejected requests before returning
(e.g., increment request count and error/upstream metrics with the appropriate
status code and reason), using the existing router metrics helper functions or
the same metric names used elsewhere, and then return the Response; reference
enforce_token_rate_limit, route_typed_request, and route_multipart_transcription
to mirror their metric emissions for early rejections.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@model_gateway/src/routers/http/pd_router.rs`:
- Around line 104-116: enforce_token_rate_limit currently returns a throttled PD
Response early which bypasses execute_dual_dispatch’s router metrics; modify
enforce_token_rate_limit (and/or its callers) to record the same router
request/upstream/error metrics for tenant throttles before returning a 429/413
response: call the existing router-metrics recording utility used by
execute_dual_dispatch (or invoke the same metric increment paths) right after
limiter.check_and_consume reports an error and before mapping to
crate::rate_limit::rate_limit_exceeded_response so tenant throttles show up in
PD dashboards/alerts.
- Around line 87-101: In count_tokens, the fallback estimate chars()/4
undercounts many languages; change both fallback branches (the None from
self.tokenizer_registry.get(model_id) and the unwrap_or_else on encode()) to use
a more conservative estimate (e.g., (text.chars().count() as u32 / 2).max(1))
instead of /4 so tenants are charged conservatively; update the early return and
the unwrap_or_else closure in the count_tokens function accordingly while
keeping the final .max(1).

In `@model_gateway/src/routers/http/router.rs`:
- Around line 1234-1245: The admission check in route_audio_transcriptions
currently calls enforce_token_rate_limit using only TranscriptionRequest,
ignoring the expensive AudioFile payload; update the admission to either (A)
perform the rate/admission check after inspecting the AudioFile (e.g., compute a
conservative surrogate token count from audio duration or byte length) and pass
that surrogate into the existing enforcement logic, or (B) skip token admission
here and rely on a later check inside route_multipart_transcription;
specifically modify route_audio_transcriptions to inspect the AudioFile (or
defer) before/when calling enforce_token_rate_limit so large uploads cannot
bypass tenant TPM using the audio payload.
- Around line 79-93: The current fallback in count_tokens underestimates tokens
by using text.chars().count()/4; update both fallback branches in count_tokens
(the tokenizer_registry.get(None) branch and the encode() unwrap_or_else failure
branch) to use a conservative estimate such as text.chars().count().max(1)
(i.e., assume ~1 token per character) instead of dividing by 4; keep references
to extract_text_for_routing(), tokenizer_registry, encode(), and count_tokens()
so you change both places consistently to avoid undercharging tenants when
tokenization is missing or fails.
- Around line 96-108: enforce_token_rate_limit currently returns early on
rate-limit failures so those 429/413 rejections never hit router telemetry;
update enforce_token_rate_limit to record the same router metrics that
route_typed_request() / route_multipart_transcription() would for rejected
requests before returning (e.g., increment request count and error/upstream
metrics with the appropriate status code and reason), using the existing router
metrics helper functions or the same metric names used elsewhere, and then
return the Response; reference enforce_token_rate_limit, route_typed_request,
and route_multipart_transcription to mirror their metric emissions for early
rejections.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 72e81bf0-1a24-4313-9091-b5ffd7544a92

📥 Commits

Reviewing files that changed from the base of the PR and between c504ad3 and beab2ef.

📒 Files selected for processing (5)
  • model_gateway/src/main.rs
  • model_gateway/src/rate_limit/error.rs
  • model_gateway/src/rate_limit/local.rs
  • model_gateway/src/routers/http/pd_router.rs
  • model_gateway/src/routers/http/router.rs

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: beab2effd1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread model_gateway/src/routers/http/router.rs Outdated
@shenoyvvarun shenoyvvarun changed the title feat(rate_limiter) Adding tenant based rate limiting based on input tokens feat(rate_limiter) Adding tenant based rate limiting based on input tokens for HTTP router Apr 29, 2026
@shenoyvvarun shenoyvvarun force-pushed the vasheno/input_token_limiting branch from beab2ef to 5c382a2 Compare May 1, 2026 19:50
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
model_gateway/src/routers/http/router.rs (2)

85-92: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use a conservative fallback when the tokenizer is unavailable.

chars / 4 is still an under-estimate for CJK, code, JSON, and other dense inputs, so any model missing from tokenizer_registry can exceed its configured TPM by a wide margin while still being admitted. This path should bias high, not low.

Suggested fix
+        let conservative_fallback = || (text.chars().count() as u32).max(1);
         let Some(tokenizer) = self.tokenizer_registry.get(model_id) else {
-            return ((text.chars().count() as u32) / 4).max(1);
+            return conservative_fallback();
         };
 
         tokenizer
             .encode(&text, false)
             .map(|encoding| encoding.token_ids().len() as u32)
-            .unwrap_or_else(|_| ((text.chars().count() as u32) / 4).max(1))
+            .unwrap_or_else(|_| conservative_fallback())
             .max(1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/http/router.rs` around lines 85 - 92, The current
fallback in the tokenizer lookup path (when tokenizer_registry.get(model_id)
returns None) under-estimates tokens by using ((text.chars().count() as u32) /
4).max(1); change this to a conservative, upper-biased estimate so models
without a tokenizer are throttled rather than under-counted — for example
replace the divide-by-4 heuristic with a higher bound such as
text.chars().count() as u32 (or text.len() as u32, or chars * 2) and use .max(1)
to avoid zero; update the fallback used in both the early return and the
unwrap_or_else branch that follows the encode call so
encode(...).map(...).unwrap_or_else(...) and the earlier return both use the
same conservative estimator.

96-108: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Record a dedicated metric for admission-control rejections.

This helper short-circuits every throttled 429/413, but it does not emit any replacement metric. That leaves tenant throttling invisible in dashboards and alerts even though skipping the existing worker-dispatch metrics is correct here. Please increment a dedicated throttle/admission counter before returning the rate-limit response.

Based on learnings, throttled HTTP requests should emit a dedicated admission-control metric rather than the existing worker-dispatch metrics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/http/router.rs` around lines 96 - 108, The
enforce_token_rate_limit function currently returns a rate-limit Response when
limiter.check_and_consume(...) errors, but does not record a dedicated
admission-control metric; modify enforce_token_rate_limit so that immediately
after detecting the error from limiter.check_and_consume (the err() branch) you
increment the dedicated throttle/admission counter (e.g., call your metrics API
such as self.metrics.increment_admission_throttle(...) or
crate::metrics::ADMISSION_THROTTLE_COUNTER.inc_with_labels(...) with
tenant_meta.tenant_key() and model_id as labels) and then return
crate::rate_limit::rate_limit_exceeded_response; ensure the metric increment
happens before mapping/returning the Response in enforce_token_rate_limit.
model_gateway/src/routers/http/pd_router.rs (1)

104-116: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Emit the dedicated throttled-request metric on limiter rejects.

This early-exit path now returns the limiter response without incrementing any admission-control counter, so tenant throttling disappears from router observability. Please record the dedicated throttle metric here before returning the 429/413 response.

📈 Suggested change
         limiter
             .check_and_consume(tenant_meta.tenant_key().as_str(), estimated_tokens)
             .err()
-            .map(crate::rate_limit::rate_limit_exceeded_response)
+            .map(|retry_after| {
+                Metrics::record_request_throttled(
+                    metrics_labels::ROUTER_HTTP,
+                    metrics_labels::BACKEND_PD,
+                    model_id,
+                    tenant_meta.tenant_key().as_str(),
+                );
+                crate::rate_limit::rate_limit_exceeded_response(retry_after)
+            })

Based on learnings, throttled HTTP-router early exits in this repo should emit a dedicated admission-control metric instead of worker-dispatch metrics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/http/pd_router.rs` around lines 104 - 116, The
early-exit in enforce_token_rate_limit returns a limiter rejection response
without emitting the admission-control "throttled request" metric; before
mapping the limiter error to crate::rate_limit::rate_limit_exceeded_response,
increment the dedicated router admission-control throttle metric (e.g.
self.metrics.admission_control.throttled_requests or call a helper like
self.emit_throttled_admission_metric) so tenant throttles are recorded when
app_token_rate_limiter (and its check_and_consume path) rejects a request.
Ensure the metric increment occurs only on the .err() branch and preserve the
existing return value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/rate_limit/local.rs`:
- Around line 170-181: The current cleanup collects stale_keys based on a
snapshot but removes them later without re-checking, allowing races where
check_and_consume() refreshes a bucket and the cleaner still deletes it; change
the loop to validate staleness under the same lock/decision point before
removing: iterate over self.buckets (like the existing iterator), acquire the
bucket's state lock, call stale_bucket_ttl(&entry.value().policy) and check
now.duration_since(state.last_refill) > ttl while holding the lock, and only
then call self.buckets.remove(entry.key()) (or remove via a deterministic path)
so removal only happens if still stale; ensure you reference the same symbols
(self.buckets, entry.value().state.lock(), stale_bucket_ttl, last_refill,
check_and_consume) and do not rely on the earlier collected stale_keys snapshot.

In `@model_gateway/src/routers/http/pd_router.rs`:
- Around line 87-102: count_tokens currently uses
body.extract_text_for_routing(), which undercounts requests that provide
pre-tokenized input (e.g., GenerationRequest with input_ids); change it to
prioritize request-native inputs: if the GenerationRequest exposes
input_ids/inputs/numeric token arrays use their length as the token count,
otherwise fall back to using the tokenizer via tokenizer_registry.get(model_id)
on the actual text (or the char-heuristic fallback). Update the logic inside
count_tokens (and any branches that call extract_text_for_routing) so it checks
for and counts body.input_ids (or equivalent accessor on GenerationRequest)
first, then uses tokenizer.encode(&text, ...) only when no native token array is
present, preserving the existing fallback to char-based estimate and min 1.

---

Duplicate comments:
In `@model_gateway/src/routers/http/pd_router.rs`:
- Around line 104-116: The early-exit in enforce_token_rate_limit returns a
limiter rejection response without emitting the admission-control "throttled
request" metric; before mapping the limiter error to
crate::rate_limit::rate_limit_exceeded_response, increment the dedicated router
admission-control throttle metric (e.g.
self.metrics.admission_control.throttled_requests or call a helper like
self.emit_throttled_admission_metric) so tenant throttles are recorded when
app_token_rate_limiter (and its check_and_consume path) rejects a request.
Ensure the metric increment occurs only on the .err() branch and preserve the
existing return value.

In `@model_gateway/src/routers/http/router.rs`:
- Around line 85-92: The current fallback in the tokenizer lookup path (when
tokenizer_registry.get(model_id) returns None) under-estimates tokens by using
((text.chars().count() as u32) / 4).max(1); change this to a conservative,
upper-biased estimate so models without a tokenizer are throttled rather than
under-counted — for example replace the divide-by-4 heuristic with a higher
bound such as text.chars().count() as u32 (or text.len() as u32, or chars * 2)
and use .max(1) to avoid zero; update the fallback used in both the early return
and the unwrap_or_else branch that follows the encode call so
encode(...).map(...).unwrap_or_else(...) and the earlier return both use the
same conservative estimator.
- Around line 96-108: The enforce_token_rate_limit function currently returns a
rate-limit Response when limiter.check_and_consume(...) errors, but does not
record a dedicated admission-control metric; modify enforce_token_rate_limit so
that immediately after detecting the error from limiter.check_and_consume (the
err() branch) you increment the dedicated throttle/admission counter (e.g., call
your metrics API such as self.metrics.increment_admission_throttle(...) or
crate::metrics::ADMISSION_THROTTLE_COUNTER.inc_with_labels(...) with
tenant_meta.tenant_key() and model_id as labels) and then return
crate::rate_limit::rate_limit_exceeded_response; ensure the metric increment
happens before mapping/returning the Response in enforce_token_rate_limit.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c6a7f03c-afac-434d-a977-2d6bbbf04e85

📥 Commits

Reviewing files that changed from the base of the PR and between beab2ef and 5c382a2.

📒 Files selected for processing (5)
  • model_gateway/src/main.rs
  • model_gateway/src/rate_limit/error.rs
  • model_gateway/src/rate_limit/local.rs
  • model_gateway/src/routers/http/pd_router.rs
  • model_gateway/src/routers/http/router.rs

Comment thread model_gateway/src/rate_limit/local.rs
Comment thread model_gateway/src/routers/http/pd_router.rs Outdated
Comment thread model_gateway/src/rate_limit/types.rs
Comment thread model_gateway/src/routers/http/router.rs Outdated
Comment thread model_gateway/src/config/types.rs Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6f2cfe1927

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread model_gateway/src/main.rs
@shenoyvvarun shenoyvvarun changed the title feat(rate_limiter) Adding tenant based rate limiting based on input tokens for HTTP router feat Adding tenant based rate limiting based on input tokens for HTTP router May 6, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8c25005e02

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread model_gateway/src/routers/common/token_count.rs
@shenoyvvarun shenoyvvarun force-pushed the vasheno/input_token_limiting branch from de410bc to d640cb6 Compare May 6, 2026 19:05
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 6, 2026

Hi @shenoyvvarun, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch:

git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease

@mergify mergify Bot added the needs-rebase PR has merge conflicts that need to be resolved label May 6, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d640cb61f6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread model_gateway/src/routers/http/router.rs
@shenoyvvarun shenoyvvarun force-pushed the vasheno/input_token_limiting branch from d640cb6 to 7db0912 Compare May 6, 2026 19:32
@mergify mergify Bot removed the needs-rebase PR has merge conflicts that need to be resolved label May 6, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7db09123fd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread model_gateway/src/rate_limit/local.rs
@shenoyvvarun shenoyvvarun requested a review from slin1237 May 14, 2026 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request model-gateway Model gateway crate changes priority:high High priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants