Skip to content

fix(local): cap keycloak jvm heap and raise nats helm wait timeout#79

Open
drobertson123 wants to merge 1 commit into
NVIDIA:mainfrom
drobertson123:fix/local-e2e-keycloak-heap-nats-timeout
Open

fix(local): cap keycloak jvm heap and raise nats helm wait timeout#79
drobertson123 wants to merge 1 commit into
NVIDIA:mainfrom
drobertson123:fix/local-e2e-keycloak-heap-nats-timeout

Conversation

@drobertson123

@drobertson123 drobertson123 commented Jun 13, 2026

Copy link
Copy Markdown

Summary

Two fixes to the local Kind e2e stack (make test) that were preventing it from completing:

  • Keycloak OOMKill crashloop. With no JVM heap settings, Quarkus defaults to MaxRAMPercentage=70%, so heap + non-heap RSS (metaspace, GC, buffers) exceeded the container memory limit and the pod was OOMKilled during its build phase — regardless of machine speed. Constrain the heap via JAVA_OPTS_KC_HEAP=-Xms256m -Xmx768m and bump the limit to 1.5Gi for headroom. Keycloak now reaches Ready in ~30s.
  • NATS deploy timeout. nats-event-bus is deployed with wait: true, but skaffold relied on Helm's default 5-minute --wait timeout. The mTLS NATS cluster can take longer to become ready on slower/loaded machines, producing context deadline exceeded even though the release ultimately deploys. Raise the upgrade --timeout to 15m on all three cluster configs (csc, cpc-1, cpc-2).

Testing

Full make test (3-cluster Kind e2e incl. federation perf tests) passes end-to-end with these changes:

Test Summary: 12 passed, 0 failed out of 12 total

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Chores
    • Enhanced Keycloak memory configuration to improve stability and prevent resource constraints during operation.
    • Extended deployment timeout for NATS services to provide additional headroom during cluster initialization and upgrades.

Keycloak was OOMKilled in a crashloop during its build phase: with no
heap settings the Quarkus default MaxRAMPercentage=70% plus non-heap RSS
(metaspace, GC, buffers) exceeded the container limit. Constrain the heap
via JAVA_OPTS_KC_HEAP so it fits, and bump the limit to 1.5Gi for
headroom. Keycloak now reaches Ready in ~30s.

The nats-event-bus Helm release uses wait: true, but skaffold relied on
Helm's default 5m --wait timeout. The mTLS NATS cluster can take longer
to become ready on slower/loaded machines, producing
"context deadline exceeded" even though the release ultimately deploys.
Raise the upgrade --timeout to 15m on all three cluster configs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@drobertson123 drobertson123 requested review from a team and Copilot June 13, 2026 19:00
@copy-pr-bot

copy-pr-bot Bot commented Jun 13, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d7de073d-cac4-4812-bd43-5b6f513a16a2

📥 Commits

Reviewing files that changed from the base of the PR and between a08ed99 and 6763346.

📒 Files selected for processing (2)
  • local/infra/keycloak/keycloak.yaml
  • local/nats/skaffold.releases.yaml

📝 Walkthrough

Walkthrough

Updates Keycloak container memory resources and adds JVM heap configuration. Extends Helm deployment upgrade timeouts from 5 to 15 minutes for three NATS release environments to accommodate mTLS cluster readiness checks.

Changes

Infrastructure deployment tuning

Layer / File(s) Summary
Keycloak memory and JVM heap tuning
local/infra/keycloak/keycloak.yaml
Memory request and limit increased for the Keycloak pod, and JAVA_OPTS_KC_HEAP environment variable added to constrain JVM heap size within the container memory limits.
NATS Helm upgrade timeout configuration
local/nats/skaffold.releases.yaml
Helm upgrade --timeout=15m flag added to three NATS release configurations (base nats-event-bus, nats-cpc-1, and nats-cpc-2) to allow extra deployment headroom for mTLS cluster readiness.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • NVIDIA/dsx-exchange#73: Both PRs modify local/nats/skaffold.releases.yaml to adjust NATS cluster readiness handling during local deployment.

Suggested reviewers

  • bryan-aguilar

Poem

🐰 Keycloak learns to fit just right,
With memory tuned and heaps in flight.
NATS takes time to cluster well,
Fifteen minutes—a patient bell! ⏰

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes both main changes: capping Keycloak JVM heap and raising NATS Helm wait timeout, matching the PR objectives.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adjusts local deployment settings to reduce flaky/failed startups on slower machines by increasing Helm timeouts for NATS and tuning Keycloak memory/JVM heap to avoid OOM kills.

Changes:

  • Increase Helm upgrade timeout to 15 minutes for local NATS releases in Skaffold.
  • Raise Keycloak container memory requests/limits.
  • Constrain Keycloak JVM heap via environment variable to keep RSS within container limits.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
local/nats/skaffold.releases.yaml Adds Helm upgrade --timeout=15m to give local NATS clusters more time to become ready.
local/infra/keycloak/keycloak.yaml Increases Keycloak memory resources and sets JVM heap bounds to reduce OOMKilled restarts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread local/nats/skaffold.releases.yaml
Comment on lines 18 to +24
resources:
requests:
cpu: "100m"
memory: "550Mi"
memory: "1Gi"
limits:
cpu: "1000m"
memory: "550Mi"
memory: "1536Mi"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drobertson123 I tuned the usage pretty low on purpose, so it can fit on smaller machines. It should be using ~80-90% of the requested memory. I measured a run just now at 458MiB used. The e2e tests in CI show the deployment success. Are you running additional activity on keycloak outside of the normal test flow?

Comment on lines +39 to +44
# Constrain the JVM heap so heap + non-heap RSS fits inside the
# container memory limit. Without this, Quarkus defaults to
# MaxRAMPercentage=70%, and heap + metaspace/GC/buffers exceed the
# limit and the pod is OOMKilled during the build phase.
- name: JAVA_OPTS_KC_HEAP
value: "-Xms256m -Xmx768m"
Comment on lines 18 to +24
resources:
requests:
cpu: "100m"
memory: "550Mi"
memory: "1Gi"
limits:
cpu: "1000m"
memory: "550Mi"
memory: "1536Mi"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drobertson123 I tuned the usage pretty low on purpose, so it can fit on smaller machines. It should be using ~80-90% of the requested memory. I measured a run just now at 458MiB used. The e2e tests in CI show the deployment success. Are you running additional activity on keycloak outside of the normal test flow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants