fix(local): cap keycloak jvm heap and raise nats helm wait timeout by drobertson123 · Pull Request #79 · NVIDIA/dsx-exchange

drobertson123 · 2026-06-13T19:00:07Z

Summary

Two fixes to the local Kind e2e stack (make test) that were preventing it from completing:

Keycloak OOMKill crashloop. With no JVM heap settings, Quarkus defaults to MaxRAMPercentage=70%, so heap + non-heap RSS (metaspace, GC, buffers) exceeded the container memory limit and the pod was OOMKilled during its build phase — regardless of machine speed. Constrain the heap via JAVA_OPTS_KC_HEAP=-Xms256m -Xmx768m and bump the limit to 1.5Gi for headroom. Keycloak now reaches Ready in ~30s.
NATS deploy timeout. nats-event-bus is deployed with wait: true, but skaffold relied on Helm's default 5-minute --wait timeout. The mTLS NATS cluster can take longer to become ready on slower/loaded machines, producing context deadline exceeded even though the release ultimately deploys. Raise the upgrade --timeout to 15m on all three cluster configs (csc, cpc-1, cpc-2).

Testing

Full make test (3-cluster Kind e2e incl. federation perf tests) passes end-to-end with these changes:

Test Summary: 12 passed, 0 failed out of 12 total

🤖 Generated with Claude Code

Summary by CodeRabbit

Chores
- Enhanced Keycloak memory configuration to improve stability and prevent resource constraints during operation.
- Extended deployment timeout for NATS services to provide additional headroom during cluster initialization and upgrades.

Keycloak was OOMKilled in a crashloop during its build phase: with no heap settings the Quarkus default MaxRAMPercentage=70% plus non-heap RSS (metaspace, GC, buffers) exceeded the container limit. Constrain the heap via JAVA_OPTS_KC_HEAP so it fits, and bump the limit to 1.5Gi for headroom. Keycloak now reaches Ready in ~30s. The nats-event-bus Helm release uses wait: true, but skaffold relied on Helm's default 5m --wait timeout. The mTLS NATS cluster can take longer to become ready on slower/loaded machines, producing "context deadline exceeded" even though the release ultimately deploys. Raise the upgrade --timeout to 15m on all three cluster configs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

copy-pr-bot · 2026-06-13T19:00:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-13T19:00:20Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d7de073d-cac4-4812-bd43-5b6f513a16a2

📥 Commits

Reviewing files that changed from the base of the PR and between a08ed99 and 6763346.

📒 Files selected for processing (2)

local/infra/keycloak/keycloak.yaml
local/nats/skaffold.releases.yaml

📝 Walkthrough

Walkthrough

Updates Keycloak container memory resources and adds JVM heap configuration. Extends Helm deployment upgrade timeouts from 5 to 15 minutes for three NATS release environments to accommodate mTLS cluster readiness checks.

Changes

Infrastructure deployment tuning

Layer / File(s)	Summary
Keycloak memory and JVM heap tuning `local/infra/keycloak/keycloak.yaml`	Memory request and limit increased for the Keycloak pod, and `JAVA_OPTS_KC_HEAP` environment variable added to constrain JVM heap size within the container memory limits.
NATS Helm upgrade timeout configuration `local/nats/skaffold.releases.yaml`	Helm upgrade `--timeout=15m` flag added to three NATS release configurations (base `nats-event-bus`, `nats-cpc-1`, and `nats-cpc-2`) to allow extra deployment headroom for mTLS cluster readiness.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

NVIDIA/dsx-exchange#73: Both PRs modify local/nats/skaffold.releases.yaml to adjust NATS cluster readiness handling during local deployment.

Suggested reviewers

bryan-aguilar

Poem

🐰 Keycloak learns to fit just right,
With memory tuned and heaps in flight.
NATS takes time to cluster well,
Fifteen minutes—a patient bell! ⏰

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes both main changes: capping Keycloak JVM heap and raising NATS Helm wait timeout, matching the PR objectives.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adjusts local deployment settings to reduce flaky/failed startups on slower machines by increasing Helm timeouts for NATS and tuning Keycloak memory/JVM heap to avoid OOM kills.

Changes:

Increase Helm upgrade timeout to 15 minutes for local NATS releases in Skaffold.
Raise Keycloak container memory requests/limits.
Constrain Keycloak JVM heap via environment variable to keep RSS within container limits.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
local/nats/skaffold.releases.yaml	Adds Helm upgrade `--timeout=15m` to give local NATS clusters more time to become ready.
local/infra/keycloak/keycloak.yaml	Increases Keycloak memory resources and sets JVM heap bounds to reduce OOMKilled restarts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

FrankSpitulski · 2026-06-15T20:35:26Z

  resources:
    requests:
      cpu: "100m"
-      memory: "550Mi"
+      memory: "1Gi"
    limits:
      cpu: "1000m"
-      memory: "550Mi"
+      memory: "1536Mi"


@drobertson123 I tuned the usage pretty low on purpose, so it can fit on smaller machines. It should be using ~80-90% of the requested memory. I measured a run just now at 458MiB used. The e2e tests in CI show the deployment success. Are you running additional activity on keycloak outside of the normal test flow?

+              # Constrain the JVM heap so heap + non-heap RSS fits inside the
+              # container memory limit. Without this, Quarkus defaults to
+              # MaxRAMPercentage=70%, and heap + metaspace/GC/buffers exceed the
+              # limit and the pod is OOMKilled during the build phase.
+              - name: JAVA_OPTS_KC_HEAP
+                value: "-Xms256m -Xmx768m"


FrankSpitulski · 2026-06-15T20:35:26Z

  resources:
    requests:
      cpu: "100m"
-      memory: "550Mi"
+      memory: "1Gi"
    limits:
      cpu: "1000m"
-      memory: "550Mi"
+      memory: "1536Mi"


@drobertson123 I tuned the usage pretty low on purpose, so it can fit on smaller machines. It should be using ~80-90% of the requested memory. I measured a run just now at 458MiB used. The e2e tests in CI show the deployment success. Are you running additional activity on keycloak outside of the normal test flow?

drobertson123 requested review from a team and Copilot June 13, 2026 19:00

Copilot AI reviewed Jun 13, 2026

View reviewed changes

FrankSpitulski requested changes Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(local): cap keycloak jvm heap and raise nats helm wait timeout#79

fix(local): cap keycloak jvm heap and raise nats helm wait timeout#79
drobertson123 wants to merge 1 commit into
NVIDIA:mainfrom
drobertson123:fix/local-e2e-keycloak-heap-nats-timeout

drobertson123 commented Jun 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Jun 13, 2026

Uh oh!

coderabbitai Bot commented Jun 13, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

FrankSpitulski Jun 15, 2026

Uh oh!

FrankSpitulski Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

drobertson123 commented Jun 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Jun 13, 2026

Uh oh!

coderabbitai Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

FrankSpitulski Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

FrankSpitulski Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

drobertson123 commented Jun 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 13, 2026 •

edited

Loading