feat Adding tenant based rate limiting based on input tokens for HTTP router#1397
feat Adding tenant based rate limiting based on input tokens for HTTP router#1397shenoyvvarun wants to merge 1 commit into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a multi-tenant rate-limiting subsystem: new rate-limit crate module (types, local in-memory limiter, error helper), config and CLI hooks, AppContext wiring, tokenizer-aware token estimation, and per-tenant enforcement in HTTP/PD routers with early rate-limit responses. Changes
Sequence DiagramsequenceDiagram
participant Client
participant Router
participant TokenizerRegistry
participant RateLimiter as LocalTokenRateLimiter
participant Bucket as TenantBucket
participant Upstream
participant ClientResponse as Response
Client->>Router: HTTP request (tenant_key, payload)
Router->>TokenizerRegistry: estimate_tokens(payload)
Router->>RateLimiter: check_and_consume(tenant_key, estimated_tokens)
RateLimiter->>Bucket: lookup/create tenant bucket
Bucket->>Bucket: refill tokens/requests based on elapsed time
Bucket->>Bucket: evaluate admission (tokens & requests)
alt admitted
Bucket->>RateLimiter: Ok
Router->>Upstream: proxy request to handler/service
Upstream->>Router: 200/response
Router->>ClientResponse: return 200
ClientResponse->>Client: 200 OK
else rejected
Bucket->>RateLimiter: Err(retry_after_secs)
RateLimiter->>Router: Err(retry_after_secs)
Router->>ClientResponse: 429 (or terminal 413) with optional Retry-After
ClientResponse->>Client: 429/413 Too Many Requests / Payload Too Large
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Hi @shenoyvvarun, the DCO sign-off check has failed. All commits must include a To fix existing commits: # Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-leaseTo sign off future commits automatically:
|
2ef701f to
619e73e
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a multi-tenant rate limiting system that enforces per-tenant limits on requests and tokens per minute. It includes a new LocalTokenRateLimiter integrated into the HTTP router, which utilizes model-specific tokenizers for usage estimation. The review feedback identifies a bug in the policy fallback logic where default limits were bypassed, suggests optimizing DashMap access to reduce lock contention, and recommends deferring tokenization until it is verified that token-based limits are applicable to the tenant.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2ef701f02b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
model_gateway/src/app_context.rs (1)
514-534: 🧹 Nitpick | 🔵 TrivialMethod name no longer reflects what it does.
maybe_rate_limiternow configures two unrelated limiters (TokenBucketformax_concurrent_requestsandLocalTokenRateLimiterfor multi-tenant token budgets). Since these have independent enable conditions and lifecycles, consider splitting intomaybe_rate_limiterandmaybe_token_rate_limiter(chained fromfrom_config) so the call site self-documents which feature each branch enables. Not a correctness issue, just keeps the builder steps cohesive as more limiter variants land in follow-up PRs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/app_context.rs` around lines 514 - 534, The method maybe_rate_limiter currently configures both the concurrent-request TokenBucket and the multi-tenant LocalTokenRateLimiter, which mixes independent features; split this into two builder steps by leaving maybe_rate_limiter to only handle max_concurrent_requests/TokenBucket and add a new maybe_token_rate_limiter (or maybe_multi_tenant_rate_limiter) that sets token_rate_limiter from RouterConfig.multi_tenant_rate_limit using LocalTokenRateLimiter; update the call site (from_config) to call .maybe_rate_limiter(...) then .maybe_token_rate_limiter(...) so each limiter (TokenBucket, LocalTokenRateLimiter) has its own enable condition and lifecycle and the code is self-documenting.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@model_gateway/src/config/builder.rs`:
- Around line 602-616: The tenant_rate_limit method silently overwrites
duplicate keys when calling self.config.multi_tenant_rate_limit.tenants.insert;
change it to detect an existing entry (use HashMap::entry or contains_key/get)
and emit a tracing::warn! that includes the tenant_key and both the previous and
new policy before overwriting, or alternatively return a ConfigError from
build_with_validation for duplicates; ensure you clone/format the tenant_key (or
check using a borrowed key) so the log/error can include the key and policy
details.
- Around line 583-616: Add a unit test that round-trips the new builder methods
through RouterConfig::build and asserts the fields are set correctly: create a
test (e.g., test_builder_multi_tenant_rate_limit_round_trip) that uses
RouterConfigBuilder::new(), calls multi_tenant_rate_limit_enabled(true),
default_tokens_per_minute(...), default_requests_per_minute(...), and
tenant_rate_limit(...) for two tenants, then .build().unwrap() and assert
config.multi_tenant_rate_limit.enabled, default_tokens_per_minute,
default_requests_per_minute, the specific tenant entries (tokens_per_minute and
requests_per_minute) for both tenants, and that tenants.len() == 2 to ensure no
regressions in the builder methods (multi_tenant_rate_limit_enabled,
default_tokens_per_minute, default_requests_per_minute, tenant_rate_limit).
In `@model_gateway/src/rate_limit/local.rs`:
- Around line 81-100: Before computing debt and retry_after, check for the
terminal condition where estimated_tokens (or need_tokens) exceeds the bucket's
full capacity (bucket.policy.tokens_per_minute > 0 and estimated_tokens as f64 >
bucket.policy.tokens_per_minute as f64); if true, return a
non-retryable/terminal rejection instead of computing token_retry so callers
don't repeatedly retry a request that will never fit. Modify the logic around
token_denied/token_retry in local.rs (the branch that computes
debt/(tokens_per_minute as f64 / 60.0)) to perform this capacity check first and
use the existing API for a terminal rejection rather than a finite retry_after.
In `@model_gateway/src/rate_limit/types.rs`:
- Around line 28-38: The fallback currently returns the zeroed singleton
DEFAULT_POLICY from the method that looks up tenant policy (the
tenants.get(...).or_else(...) block), which ignores configured defaults
(default_tokens_per_minute/default_requests_per_minute) and causes unlimited
behavior; change the fallback to construct and return a TenantTokenPolicy
populated from self.default_tokens_per_minute and
self.default_requests_per_minute instead of &DEFAULT_POLICY (i.e., produce an
owned TenantTokenPolicy with the configured values), and then remove the extra
.cloned() at the caller in LocalTokenRateLimiter::check_and_consume since the
method will already return an owned policy.
In `@model_gateway/src/routers/http/router.rs`:
- Around line 96-108: enforce_token_rate_limit currently commits token usage via
limiter.check_and_consume(...) which can overcharge on retries or pre-flight
failures; add metrics to make this visible: after a successful check_and_consume
call inside enforce_token_rate_limit increment an "tokens_admitted" counter (or
a scoped meter on app_token_rate_limiter), and then update the error paths that
can occur after admission (e.g., in route_typed_request_once,
select_worker_for_model failure paths, and
RetryExecutor::execute_response_with_retry when upstream returns 5xx or client
disconnects) to increment an "admitted_but_failed" counter so operators can
monitor overcharged requests; use the existing limiter or global metrics
facility and refer to the functions enforce_token_rate_limit, check_and_consume,
rate_limit_exceeded_response, route_typed_request_once, select_worker_for_model,
and RetryExecutor::execute_response_with_retry to locate where to increment each
metric.
- Around line 79-94: The count_tokens function currently falls back to
(text.chars().count() / 4) when tokenizer lookup or encode fails, which
underestimates tokens for CJK and code; update the fallback in router.rs (inside
count_tokens, the branch handling self.tokenizer_registry.get(model_id) being
None and the encode unwrap_or_else case) to use a conservative estimator such as
text.chars().count().max(1) or bytes()/3 (e.g., text.len() / 3). Keep the final
.max(1) to avoid zero, and apply the same conservative fallback to both the
missing-tokenizer path and the encode-error path; also add a short doc comment
on the public limiter type mentioning the estimator is conservative and that
configured TPM is not a hard ceiling when tokenizer is missing.
---
Outside diff comments:
In `@model_gateway/src/app_context.rs`:
- Around line 514-534: The method maybe_rate_limiter currently configures both
the concurrent-request TokenBucket and the multi-tenant LocalTokenRateLimiter,
which mixes independent features; split this into two builder steps by leaving
maybe_rate_limiter to only handle max_concurrent_requests/TokenBucket and add a
new maybe_token_rate_limiter (or maybe_multi_tenant_rate_limiter) that sets
token_rate_limiter from RouterConfig.multi_tenant_rate_limit using
LocalTokenRateLimiter; update the call site (from_config) to call
.maybe_rate_limiter(...) then .maybe_token_rate_limiter(...) so each limiter
(TokenBucket, LocalTokenRateLimiter) has its own enable condition and lifecycle
and the code is self-documenting.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2f8600a7-8717-4fb3-a379-34b3c64a181e
📒 Files selected for processing (11)
model_gateway/src/app_context.rsmodel_gateway/src/config/builder.rsmodel_gateway/src/config/types.rsmodel_gateway/src/lib.rsmodel_gateway/src/main.rsmodel_gateway/src/rate_limit/error.rsmodel_gateway/src/rate_limit/local.rsmodel_gateway/src/rate_limit/mod.rsmodel_gateway/src/rate_limit/types.rsmodel_gateway/src/routers/http/router.rsmodel_gateway/src/service_discovery.rs
619e73e to
bb99ad1
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bb99ad1fdb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
model_gateway/src/routers/http/router.rs (1)
1252-1276:⚠️ Potential issue | 🟡 MinorApply rate limit to
route_rerank.
route_rerankignores_tenant_metaand bypasses tenant rate limiting, unlikeroute_chat,route_completion,route_responses,route_embeddings, androute_classify. SinceRerankRequestimplementsGenerationRequest, the sameenforce_token_rate_limitcheck can be applied. Rerank payloads contain a query plus candidate documents that consume meaningful input tokens, so this endpoint should respect tenant TPM quotas.♻️ Apply rate limit to rerank
async fn route_rerank( &self, headers: Option<&HeaderMap>, - _tenant_meta: &TenantRequestMeta, + tenant_meta: &TenantRequestMeta, body: &RerankRequest, model_id: &str, ) -> Response { + if let Some(response) = self.enforce_token_rate_limit(tenant_meta, body, model_id) { + return response; + } let response = self .route_typed_request(headers, body, "/v1/rerank", model_id) .await;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/http/router.rs` around lines 1252 - 1276, Add tenant token-rate enforcement to route_rerank the same way other generation endpoints do: before calling route_typed_request inside async fn route_rerank, call enforce_token_rate_limit(_tenant_meta, body) (treating body as a GenerationRequest) and if it returns an Err response return that response immediately; only proceed to route_typed_request and the existing build_rerank_response on success. Reference route_rerank, enforce_token_rate_limit, and GenerationRequest to locate and mirror the rate-limit check pattern used in route_chat/route_completion/route_embeddings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@model_gateway/src/main.rs`:
- Around line 1292-1305: The parsing of self.tenant_rate_limits using rsplitn(3,
':') currently allows an empty tenant_key (e.g. ":1000:10") which silently
installs a global override; update the validation in the loop that builds
tenant_rate_limit so that after extracting tenant_key, tpm and rpm you also
reject empty or all-whitespace tenant keys (use tenant_key.trim().is_empty() or
tenant_key.is_empty()) and return ConfigError::ValidationFailed with a clear
message when the tenant_key is empty instead of calling
builder.tenant_rate_limit; keep existing checks for tpm/rpm and only call
builder.tenant_rate_limit(tenant_key, tpm, rpm) when tenant_key is non-empty.
In `@model_gateway/src/rate_limit/error.rs`:
- Around line 16-24: The code currently treats
TERMINAL_REJECTION_RETRY_AFTER_SECS the same as transient 429
tenant_rate_limit_exceeded responses; change the branch that checks if
retry_after_secs == TERMINAL_REJECTION_RETRY_AFTER_SECS to build and return a
distinct error response (e.g., set a different HTTP status like 413 or a custom
4xx, and modify the body/message to indicate a terminal oversize rejection)
instead of returning the existing response; update the branch that currently
only omits the RETRY_AFTER header so it constructs the new response (including
an informative message referencing the tenant capacity limit) while leaving
other logic intact and keep symbol names TERMINAL_REJECTION_RETRY_AFTER_SECS,
RETRY_AFTER and tenant_rate_limit_exceeded to locate the change.
In `@model_gateway/src/rate_limit/local.rs`:
- Around line 23-69: LocalTokenRateLimiter currently stores every seen
tenant_key indefinitely in buckets: DashMap<String, Arc<Bucket>>, which causes
unbounded memory growth (see LocalTokenRateLimiter, buckets, Bucket and
BucketState.last_refill) and also freezes a policy snapshot at creation
(policy.clone()). Update the code and docs: add a doc comment on
LocalTokenRateLimiter warning that bucket count grows with distinct tenant_keys
and recommend pairing with RouterConfigBuilder::trust_tenant_header = false or
upstream tenant validation; implement a coarse periodic cleanup task (or a sweep
method invoked from a timer) that iterates buckets and removes entries whose
BucketState.last_refill is older than a configurable TTL (e.g. 10× refill
window) to bound cardinality; ensure cleanup uses DashMap removal APIs safely
and note that existing buckets hold cloned policy so runtime policy updates
won’t affect already-allocated buckets (document that behavior and consider
invalidating/replacing buckets on policy change).
---
Outside diff comments:
In `@model_gateway/src/routers/http/router.rs`:
- Around line 1252-1276: Add tenant token-rate enforcement to route_rerank the
same way other generation endpoints do: before calling route_typed_request
inside async fn route_rerank, call enforce_token_rate_limit(_tenant_meta, body)
(treating body as a GenerationRequest) and if it returns an Err response return
that response immediately; only proceed to route_typed_request and the existing
build_rerank_response on success. Reference route_rerank,
enforce_token_rate_limit, and GenerationRequest to locate and mirror the
rate-limit check pattern used in route_chat/route_completion/route_embeddings.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: bfd1b7cf-3733-404b-9d44-8889b9881474
📒 Files selected for processing (11)
model_gateway/src/app_context.rsmodel_gateway/src/config/builder.rsmodel_gateway/src/config/types.rsmodel_gateway/src/lib.rsmodel_gateway/src/main.rsmodel_gateway/src/rate_limit/error.rsmodel_gateway/src/rate_limit/local.rsmodel_gateway/src/rate_limit/mod.rsmodel_gateway/src/rate_limit/types.rsmodel_gateway/src/routers/http/router.rsmodel_gateway/src/service_discovery.rs
|
Hi @shenoyvvarun, the DCO sign-off check has failed. All commits must include a To fix existing commits: # Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-leaseTo sign off future commits automatically:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c504ad3962
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
model_gateway/src/routers/http/router.rs (1)
79-93:⚠️ Potential issue | 🟡 MinorUse a conservative fallback when tokenizer lookup fails.
Lines 85-92 still fall back to
(chars / 4), which underestimates CJK/code-heavy prompts and lets tenants overshoot TPM whenever a tokenizer is missing orencode()errors. Please switch to a conservative upper bound instead; the same estimator was copied intomodel_gateway/src/routers/http/pd_router.rs, so both helpers should be fixed together.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/http/router.rs` around lines 79 - 93, The token-count fallback in count_tokens currently underestimates tokens by using (chars / 4); update the fallback to a conservative upper bound (e.g., use text.chars().count() as u32 .max(1)) whenever tokenizer lookup (tokenizer_registry.get) fails or encode() errors, and apply the same change to the sibling helper in pd_router.rs so both use the conservative chars-based estimator instead of dividing by 4.model_gateway/src/main.rs (1)
1297-1305:⚠️ Potential issue | 🟡 MinorAlso reject whitespace-only tenant keys.
Line 1298 only checks
is_empty(), so a value like" :1000:10"still creates an override instead of failing startup. Usetenant_key.trim().is_empty()before callingbuilder.tenant_rate_limit(...).Suggested fix
- if tenant_key.is_empty() { + if tenant_key.trim().is_empty() { return Err(ConfigError::ValidationFailed { reason: format!( "invalid --tenant-rate-limit '{spec}'; expected tenant_key:tpm:rpm"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/main.rs` around lines 1297 - 1305, The validation currently only checks tenant_key.is_empty(), so whitespace-only keys slip through; update the check inside the if-let (Some(tenant_key), Some(tpm), Some(rpm)) block to reject keys where tenant_key.trim().is_empty(), and when calling builder.tenant_rate_limit(tenant_key, tpm, rpm) use the trimmed key (e.g., let key = tenant_key.trim();) so you both validate and pass a normalized non-empty tenant key to tenant_rate_limit; keep the same ConfigError::ValidationFailed path when the trimmed key is empty.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@model_gateway/src/routers/http/pd_router.rs`:
- Around line 104-116: When enforce_token_rate_limit returns an early 429/413,
emit the normal router metrics first: call
self.metrics.record_router_request(...) with the tenant_meta and model_id, then
build the rate-limited Response via
crate::rate_limit::rate_limit_exceeded_response(...) and call
self.metrics.record_router_upstream_response(...) (and the terminal
self.metrics.record_router_error(...) if your error path normally records
terminal errors) using the generated response's status so the router telemetry
sees throttled PD traffic; do this inside enforce_token_rate_limit (use the
existing count_tokens, limiter.check_and_consume, and the
rate_limit_exceeded_response result) before returning the Response.
In `@model_gateway/src/routers/http/router.rs`:
- Around line 96-108: enforce_token_rate_limit currently returns Some(Response)
for throttled requests and bypasses the existing metrics paths
(record_router_request, record_router_upstream_response, record_router_error)
used by route_typed_request and route_multipart_transcription; update
enforce_token_rate_limit so that when it generates a 429/413 it also records the
appropriate router metrics before returning (or alternatively return a typed
error/enum so the caller route_typed_request/route_multipart_transcription can
invoke the usual record_router_* hooks), referencing enforce_token_rate_limit,
route_typed_request, route_multipart_transcription, and the metric functions
record_router_request / record_router_upstream_response / record_router_error to
locate where to add the metric calls or change the return type.
- Around line 1234-1245: route_audio_transcriptions currently enforces token
rate limits using only TranscriptionRequest, letting a large AudioFile bypass
metering; update this by either skipping token-based admission for this endpoint
or by computing a conservative surrogate token count from the AudioFile (e.g.,
derive duration or size-to-token estimate) and feed that into the admission
check. Concretely, in route_audio_transcriptions (and before calling
enforce_token_rate_limit or in enforce_token_rate_limit itself) compute an
estimated token count from the AudioFile metadata (bytes or duration ->
conservative token estimate) and use that estimate together with
TranscriptionRequest when calling enforce_token_rate_limit, or remove token
admission and rely on other limits if you prefer skipping token-based admission
for this endpoint. Ensure you reference the existing symbols
route_audio_transcriptions, AudioFile, TranscriptionRequest, and
enforce_token_rate_limit (or adjust route_multipart_transcription call) so the
change is localized and clear.
---
Duplicate comments:
In `@model_gateway/src/main.rs`:
- Around line 1297-1305: The validation currently only checks
tenant_key.is_empty(), so whitespace-only keys slip through; update the check
inside the if-let (Some(tenant_key), Some(tpm), Some(rpm)) block to reject keys
where tenant_key.trim().is_empty(), and when calling
builder.tenant_rate_limit(tenant_key, tpm, rpm) use the trimmed key (e.g., let
key = tenant_key.trim();) so you both validate and pass a normalized non-empty
tenant key to tenant_rate_limit; keep the same ConfigError::ValidationFailed
path when the trimmed key is empty.
In `@model_gateway/src/routers/http/router.rs`:
- Around line 79-93: The token-count fallback in count_tokens currently
underestimates tokens by using (chars / 4); update the fallback to a
conservative upper bound (e.g., use text.chars().count() as u32 .max(1))
whenever tokenizer lookup (tokenizer_registry.get) fails or encode() errors, and
apply the same change to the sibling helper in pd_router.rs so both use the
conservative chars-based estimator instead of dividing by 4.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: da364dd7-03a6-45b4-ac39-96a7723f1911
📒 Files selected for processing (5)
model_gateway/src/main.rsmodel_gateway/src/rate_limit/error.rsmodel_gateway/src/rate_limit/local.rsmodel_gateway/src/routers/http/pd_router.rsmodel_gateway/src/routers/http/router.rs
c504ad3 to
beab2ef
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (5)
model_gateway/src/routers/http/pd_router.rs (2)
104-116:⚠️ Potential issue | 🟠 MajorRecord router metrics before returning a throttled PD response.
execute_dual_dispatch()owns the normal router request/upstream/error metrics. Returning fromenforce_token_rate_limit()skips that path entirely, so 429/413 tenant throttles disappear from PD dashboards and alerts.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/http/pd_router.rs` around lines 104 - 116, enforce_token_rate_limit currently returns a throttled PD Response early which bypasses execute_dual_dispatch’s router metrics; modify enforce_token_rate_limit (and/or its callers) to record the same router request/upstream/error metrics for tenant throttles before returning a 429/413 response: call the existing router-metrics recording utility used by execute_dual_dispatch (or invoke the same metric increment paths) right after limiter.check_and_consume reports an error and before mapping to crate::rate_limit::rate_limit_exceeded_response so tenant throttles show up in PD dashboards/alerts.
87-101:⚠️ Potential issue | 🟡 MinorUse a conservative tokenizer fallback.
chars()/4materially undercounts CJK, code, and JSON when the tokenizer is missing orencode()fails, so tenants can exceed their configured TPM by a large margin. Charge from a conservative bound in both fallback branches instead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/http/pd_router.rs` around lines 87 - 101, In count_tokens, the fallback estimate chars()/4 undercounts many languages; change both fallback branches (the None from self.tokenizer_registry.get(model_id) and the unwrap_or_else on encode()) to use a more conservative estimate (e.g., (text.chars().count() as u32 / 2).max(1)) instead of /4 so tenants are charged conservatively; update the early return and the unwrap_or_else closure in the count_tokens function accordingly while keeping the final .max(1).model_gateway/src/routers/http/router.rs (3)
1234-1245:⚠️ Potential issue | 🟠 Major
/v1/audio/transcriptionsis still effectively unmetered.This admission check only looks at
TranscriptionRequest, but the expensive input lives inAudioFile. A large upload with an empty prompt is still charged as ~1 token, so tenant TPM can be bypassed on this endpoint. Either skip token admission here for now or charge from a conservative surrogate derived from the audio payload.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/http/router.rs` around lines 1234 - 1245, The admission check in route_audio_transcriptions currently calls enforce_token_rate_limit using only TranscriptionRequest, ignoring the expensive AudioFile payload; update the admission to either (A) perform the rate/admission check after inspecting the AudioFile (e.g., compute a conservative surrogate token count from audio duration or byte length) and pass that surrogate into the existing enforcement logic, or (B) skip token admission here and rely on a later check inside route_multipart_transcription; specifically modify route_audio_transcriptions to inspect the AudioFile (or defer) before/when calling enforce_token_rate_limit so large uploads cannot bypass tenant TPM using the audio payload.
79-93:⚠️ Potential issue | 🟡 MinorUse a conservative tokenizer fallback.
chars()/4materially undercounts CJK, code, and JSON when the tokenizer is missing orencode()fails, so tenants can exceed their configured TPM by a large margin. Charge from a conservative bound in both fallback branches instead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/http/router.rs` around lines 79 - 93, The current fallback in count_tokens underestimates tokens by using text.chars().count()/4; update both fallback branches in count_tokens (the tokenizer_registry.get(None) branch and the encode() unwrap_or_else failure branch) to use a conservative estimate such as text.chars().count().max(1) (i.e., assume ~1 token per character) instead of dividing by 4; keep references to extract_text_for_routing(), tokenizer_registry, encode(), and count_tokens() so you change both places consistently to avoid undercharging tenants when tokenization is missing or fails.
96-108:⚠️ Potential issue | 🟠 MajorEmit router metrics for early rate-limit rejections.
This helper returns before
route_typed_request()orroute_multipart_transcription()records request/upstream/error metrics, so throttled 429/413s are invisible to router telemetry.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/http/router.rs` around lines 96 - 108, enforce_token_rate_limit currently returns early on rate-limit failures so those 429/413 rejections never hit router telemetry; update enforce_token_rate_limit to record the same router metrics that route_typed_request() / route_multipart_transcription() would for rejected requests before returning (e.g., increment request count and error/upstream metrics with the appropriate status code and reason), using the existing router metrics helper functions or the same metric names used elsewhere, and then return the Response; reference enforce_token_rate_limit, route_typed_request, and route_multipart_transcription to mirror their metric emissions for early rejections.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@model_gateway/src/routers/http/pd_router.rs`:
- Around line 104-116: enforce_token_rate_limit currently returns a throttled PD
Response early which bypasses execute_dual_dispatch’s router metrics; modify
enforce_token_rate_limit (and/or its callers) to record the same router
request/upstream/error metrics for tenant throttles before returning a 429/413
response: call the existing router-metrics recording utility used by
execute_dual_dispatch (or invoke the same metric increment paths) right after
limiter.check_and_consume reports an error and before mapping to
crate::rate_limit::rate_limit_exceeded_response so tenant throttles show up in
PD dashboards/alerts.
- Around line 87-101: In count_tokens, the fallback estimate chars()/4
undercounts many languages; change both fallback branches (the None from
self.tokenizer_registry.get(model_id) and the unwrap_or_else on encode()) to use
a more conservative estimate (e.g., (text.chars().count() as u32 / 2).max(1))
instead of /4 so tenants are charged conservatively; update the early return and
the unwrap_or_else closure in the count_tokens function accordingly while
keeping the final .max(1).
In `@model_gateway/src/routers/http/router.rs`:
- Around line 1234-1245: The admission check in route_audio_transcriptions
currently calls enforce_token_rate_limit using only TranscriptionRequest,
ignoring the expensive AudioFile payload; update the admission to either (A)
perform the rate/admission check after inspecting the AudioFile (e.g., compute a
conservative surrogate token count from audio duration or byte length) and pass
that surrogate into the existing enforcement logic, or (B) skip token admission
here and rely on a later check inside route_multipart_transcription;
specifically modify route_audio_transcriptions to inspect the AudioFile (or
defer) before/when calling enforce_token_rate_limit so large uploads cannot
bypass tenant TPM using the audio payload.
- Around line 79-93: The current fallback in count_tokens underestimates tokens
by using text.chars().count()/4; update both fallback branches in count_tokens
(the tokenizer_registry.get(None) branch and the encode() unwrap_or_else failure
branch) to use a conservative estimate such as text.chars().count().max(1)
(i.e., assume ~1 token per character) instead of dividing by 4; keep references
to extract_text_for_routing(), tokenizer_registry, encode(), and count_tokens()
so you change both places consistently to avoid undercharging tenants when
tokenization is missing or fails.
- Around line 96-108: enforce_token_rate_limit currently returns early on
rate-limit failures so those 429/413 rejections never hit router telemetry;
update enforce_token_rate_limit to record the same router metrics that
route_typed_request() / route_multipart_transcription() would for rejected
requests before returning (e.g., increment request count and error/upstream
metrics with the appropriate status code and reason), using the existing router
metrics helper functions or the same metric names used elsewhere, and then
return the Response; reference enforce_token_rate_limit, route_typed_request,
and route_multipart_transcription to mirror their metric emissions for early
rejections.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 72e81bf0-1a24-4313-9091-b5ffd7544a92
📒 Files selected for processing (5)
model_gateway/src/main.rsmodel_gateway/src/rate_limit/error.rsmodel_gateway/src/rate_limit/local.rsmodel_gateway/src/routers/http/pd_router.rsmodel_gateway/src/routers/http/router.rs
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: beab2effd1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
beab2ef to
5c382a2
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
model_gateway/src/routers/http/router.rs (2)
85-92:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse a conservative fallback when the tokenizer is unavailable.
chars / 4is still an under-estimate for CJK, code, JSON, and other dense inputs, so any model missing fromtokenizer_registrycan exceed its configured TPM by a wide margin while still being admitted. This path should bias high, not low.Suggested fix
+ let conservative_fallback = || (text.chars().count() as u32).max(1); let Some(tokenizer) = self.tokenizer_registry.get(model_id) else { - return ((text.chars().count() as u32) / 4).max(1); + return conservative_fallback(); }; tokenizer .encode(&text, false) .map(|encoding| encoding.token_ids().len() as u32) - .unwrap_or_else(|_| ((text.chars().count() as u32) / 4).max(1)) + .unwrap_or_else(|_| conservative_fallback()) .max(1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/http/router.rs` around lines 85 - 92, The current fallback in the tokenizer lookup path (when tokenizer_registry.get(model_id) returns None) under-estimates tokens by using ((text.chars().count() as u32) / 4).max(1); change this to a conservative, upper-biased estimate so models without a tokenizer are throttled rather than under-counted — for example replace the divide-by-4 heuristic with a higher bound such as text.chars().count() as u32 (or text.len() as u32, or chars * 2) and use .max(1) to avoid zero; update the fallback used in both the early return and the unwrap_or_else branch that follows the encode call so encode(...).map(...).unwrap_or_else(...) and the earlier return both use the same conservative estimator.
96-108:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRecord a dedicated metric for admission-control rejections.
This helper short-circuits every throttled 429/413, but it does not emit any replacement metric. That leaves tenant throttling invisible in dashboards and alerts even though skipping the existing worker-dispatch metrics is correct here. Please increment a dedicated throttle/admission counter before returning the rate-limit response.
Based on learnings, throttled HTTP requests should emit a dedicated admission-control metric rather than the existing worker-dispatch metrics.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/http/router.rs` around lines 96 - 108, The enforce_token_rate_limit function currently returns a rate-limit Response when limiter.check_and_consume(...) errors, but does not record a dedicated admission-control metric; modify enforce_token_rate_limit so that immediately after detecting the error from limiter.check_and_consume (the err() branch) you increment the dedicated throttle/admission counter (e.g., call your metrics API such as self.metrics.increment_admission_throttle(...) or crate::metrics::ADMISSION_THROTTLE_COUNTER.inc_with_labels(...) with tenant_meta.tenant_key() and model_id as labels) and then return crate::rate_limit::rate_limit_exceeded_response; ensure the metric increment happens before mapping/returning the Response in enforce_token_rate_limit.model_gateway/src/routers/http/pd_router.rs (1)
104-116: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winEmit the dedicated throttled-request metric on limiter rejects.
This early-exit path now returns the limiter response without incrementing any admission-control counter, so tenant throttling disappears from router observability. Please record the dedicated throttle metric here before returning the 429/413 response.
📈 Suggested change
limiter .check_and_consume(tenant_meta.tenant_key().as_str(), estimated_tokens) .err() - .map(crate::rate_limit::rate_limit_exceeded_response) + .map(|retry_after| { + Metrics::record_request_throttled( + metrics_labels::ROUTER_HTTP, + metrics_labels::BACKEND_PD, + model_id, + tenant_meta.tenant_key().as_str(), + ); + crate::rate_limit::rate_limit_exceeded_response(retry_after) + })Based on learnings, throttled HTTP-router early exits in this repo should emit a dedicated admission-control metric instead of worker-dispatch metrics.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/http/pd_router.rs` around lines 104 - 116, The early-exit in enforce_token_rate_limit returns a limiter rejection response without emitting the admission-control "throttled request" metric; before mapping the limiter error to crate::rate_limit::rate_limit_exceeded_response, increment the dedicated router admission-control throttle metric (e.g. self.metrics.admission_control.throttled_requests or call a helper like self.emit_throttled_admission_metric) so tenant throttles are recorded when app_token_rate_limiter (and its check_and_consume path) rejects a request. Ensure the metric increment occurs only on the .err() branch and preserve the existing return value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@model_gateway/src/rate_limit/local.rs`:
- Around line 170-181: The current cleanup collects stale_keys based on a
snapshot but removes them later without re-checking, allowing races where
check_and_consume() refreshes a bucket and the cleaner still deletes it; change
the loop to validate staleness under the same lock/decision point before
removing: iterate over self.buckets (like the existing iterator), acquire the
bucket's state lock, call stale_bucket_ttl(&entry.value().policy) and check
now.duration_since(state.last_refill) > ttl while holding the lock, and only
then call self.buckets.remove(entry.key()) (or remove via a deterministic path)
so removal only happens if still stale; ensure you reference the same symbols
(self.buckets, entry.value().state.lock(), stale_bucket_ttl, last_refill,
check_and_consume) and do not rely on the earlier collected stale_keys snapshot.
In `@model_gateway/src/routers/http/pd_router.rs`:
- Around line 87-102: count_tokens currently uses
body.extract_text_for_routing(), which undercounts requests that provide
pre-tokenized input (e.g., GenerationRequest with input_ids); change it to
prioritize request-native inputs: if the GenerationRequest exposes
input_ids/inputs/numeric token arrays use their length as the token count,
otherwise fall back to using the tokenizer via tokenizer_registry.get(model_id)
on the actual text (or the char-heuristic fallback). Update the logic inside
count_tokens (and any branches that call extract_text_for_routing) so it checks
for and counts body.input_ids (or equivalent accessor on GenerationRequest)
first, then uses tokenizer.encode(&text, ...) only when no native token array is
present, preserving the existing fallback to char-based estimate and min 1.
---
Duplicate comments:
In `@model_gateway/src/routers/http/pd_router.rs`:
- Around line 104-116: The early-exit in enforce_token_rate_limit returns a
limiter rejection response without emitting the admission-control "throttled
request" metric; before mapping the limiter error to
crate::rate_limit::rate_limit_exceeded_response, increment the dedicated router
admission-control throttle metric (e.g.
self.metrics.admission_control.throttled_requests or call a helper like
self.emit_throttled_admission_metric) so tenant throttles are recorded when
app_token_rate_limiter (and its check_and_consume path) rejects a request.
Ensure the metric increment occurs only on the .err() branch and preserve the
existing return value.
In `@model_gateway/src/routers/http/router.rs`:
- Around line 85-92: The current fallback in the tokenizer lookup path (when
tokenizer_registry.get(model_id) returns None) under-estimates tokens by using
((text.chars().count() as u32) / 4).max(1); change this to a conservative,
upper-biased estimate so models without a tokenizer are throttled rather than
under-counted — for example replace the divide-by-4 heuristic with a higher
bound such as text.chars().count() as u32 (or text.len() as u32, or chars * 2)
and use .max(1) to avoid zero; update the fallback used in both the early return
and the unwrap_or_else branch that follows the encode call so
encode(...).map(...).unwrap_or_else(...) and the earlier return both use the
same conservative estimator.
- Around line 96-108: The enforce_token_rate_limit function currently returns a
rate-limit Response when limiter.check_and_consume(...) errors, but does not
record a dedicated admission-control metric; modify enforce_token_rate_limit so
that immediately after detecting the error from limiter.check_and_consume (the
err() branch) you increment the dedicated throttle/admission counter (e.g., call
your metrics API such as self.metrics.increment_admission_throttle(...) or
crate::metrics::ADMISSION_THROTTLE_COUNTER.inc_with_labels(...) with
tenant_meta.tenant_key() and model_id as labels) and then return
crate::rate_limit::rate_limit_exceeded_response; ensure the metric increment
happens before mapping/returning the Response in enforce_token_rate_limit.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c6a7f03c-afac-434d-a977-2d6bbbf04e85
📒 Files selected for processing (5)
model_gateway/src/main.rsmodel_gateway/src/rate_limit/error.rsmodel_gateway/src/rate_limit/local.rsmodel_gateway/src/routers/http/pd_router.rsmodel_gateway/src/routers/http/router.rs
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6f2cfe1927
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8c25005e02
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
de410bc to
d640cb6
Compare
|
Hi @shenoyvvarun, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch: git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease |
…okens. Signed-off-by: Varun Shenoy <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d640cb61f6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
d640cb6 to
7db0912
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7db09123fd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Description
Problem
High level Design
Next steps
Only input tokens are charged atm, output_tokens next.
No reserve-once / settle-later reservation handle yet
Per-model layered scopes
Local backend only; no external backend
No explicit request metadata / charge-id plumbing yet
No exact-token gRPC pre-retry preparation split yet
Extend the limiter from input-token-only admission to full input + output token accounting
Introduce reserve-once / settle-later semantics so retries do not double-charge
Solution
It wires the limiter into app startup and HTTP routing, and adds unit tests for limiter behavior and policy lookup.
Test cases (Passed)
1. Tenant isolation
Goal: prove one tenant does not affect another.
Test:
exhaust team-a
immediately send same request as team-b
Expected:
team-a gets 429
team-b still succeeds
2. Default policy fallback
Goal: verify tenants without explicit override use the default policy.
Test:
send requests with:
x-smg-tenant-id: unknown-team
do not define header:unknown-team in CLI config
Expected:
default TPM/RPM limits apply
not unlimited
Checklist
cargo +nightly fmtpassescargo clippy --all-targets --all-features -- -D warningspassesSummary by CodeRabbit
New Features
Behavior Changes