PAAL-123 kubernetes health probes #21

qa-jil-kamerling · 2025-09-24T10:37:43Z

Implement Protocol-Aware Health Checking for Agent Deployments

Replaces generic health checking with intelligent, protocol-aware readiness probes that validate actual agent functionality rather than just HTTP server availability.

Key Changes

A2A Agents: HTTP GET probe to /a2a/.well-known/agent-card.json validates agent skills are loaded and A2A framework is ready
OpenAI Agents: TCP socket probe validates service connectivity
No Protocol: No readiness probe generated (allows custom health checking)
Automatic Detection: Operator analyzes agent protocols and generates appropriate probes
Deployment Updates: Smart probe comparison triggers updates only when necessary

Related PR

agentic-layer/showcase-cross-selling

Impact

Better Readiness Signals: Pods only Ready when agents are truly functional
Protocol-Aware: Different validation strategies for different agent types
Operational Excellence: Centralized health management via operator
Future-Proof: Easily extensible for new protocols

CLAUDE.md

config/samples/runtime_v1alpha1_agent.yaml

config/samples/runtime_v1alpha1_agent_mixed.yaml

config/samples/runtime_v1alpha1_agent_template.yaml

internal/controller/agent_controller.go

test/e2e/e2e_test.go

CLAUDE.md

internal/controller/agent_controller.go

claude · 2025-09-26T10:08:21Z

Claude finished @qa-jil-kamerling's task —— View job

Conducting comprehensive PR review

Todo List

Read and analyze core controller implementation changes
Review test implementation and coverage
Examine sample configurations
Check documentation updates
Address specific reviewer feedback points
Provide comprehensive feedback on code quality, bugs, performance, security, and test coverage

Comprehensive Code Review

Overall Assessment: Strong Implementation with Critical Bug

This PR implements protocol-aware health checking for Agent deployments with comprehensive test coverage (649+ new lines of tests). The architecture is sound and follows Kubernetes best practices, but there is a critical hardcoded path bug that prevents this from working correctly with template agents.

🔧 Code Quality & Architecture

✅ Strengths:

Clean separation of concerns: Health probe logic is well-organized with dedicated helper functions
Protocol-aware design: Smart prioritization (A2A > OpenAI > None) matches business requirements
Comprehensive test coverage: 13 new test cases covering edge cases and protocol combinations
Proper error handling: Good use of early returns and nil checks
Kubernetes best practices: Proper probe configuration with appropriate timeouts and thresholds

⚠️ Critical Issues to Address:

1. Hardcoded Health Check Paths (internal/controller/agent_controller.go:424)

The reviewer g3force is absolutely correct - there's a critical flaw:

func (r *AgentReconciler) getA2AHealthPath(protocol *runtimev1alpha1.AgentProtocol) string {
    // ...
    return "/a2a" + basePath  // Line 424 - Hardcoded "/a2a" prefix
}

Problem: This hardcodes /a2a/.well-known/agent-card.json which breaks:

Template agents that may not have the /a2a path prefix
Weather agents with different path structures (as seen in config/samples/runtime_v1alpha1_agent_template.yaml:24 with path: "/")
Agents using custom protocol path configurations

Impact: A2A agents will fail readiness checks, causing deployment failures.

2. Test Inconsistencies (internal/controller/agent_controller_test.go:148)

Tests are using hardcoded path expectations that don't match the flexible implementation:

Expect(probe.HTTPGet.Path).To(Equal(a2aAgentCardEndpoint))  // Should test actual protocol config

3. Documentation Mismatch (CLAUDE.md:195)
Documentation claims paths are configurable but references hardcoded defaults that don't match implementation.

🐛 Potential Bugs

1. Path Configuration Bug (High Priority)

Template agents with path: "/" will still get /a2a prefix, breaking health checks
No validation that protocol paths are valid HTTP paths

2. Missing Port Validation

No validation that protocol ports are in valid range (1-65535)
Could lead to invalid probe configurations

3. Protocol Configuration Edge Cases

No handling for malformed protocol configurations
Missing validation for required protocol fields

⚡ Performance Considerations

✅ Good Performance:

Efficient deployment update detection using equality.ProbesEqual() prevents unnecessary Kubernetes API calls
Smart probe comparison logic minimizes reconciliation overhead
Protocol detection uses simple iteration (acceptable for small protocol lists)

💡 Minor Optimizations:

Protocol lookup could be cached if agents frequently have many protocols
Map-based protocol lookup could improve O(n) to O(1) for large protocol lists

🔒 Security Assessment

✅ Security Strengths:

No exposure of sensitive information in health check configurations
Proper use of Kubernetes readiness probes (non-privileged)
No hardcoded credentials or secrets

⚠️ Security Considerations:

A2A health check endpoints could expose internal agent metadata
No rate limiting on health check endpoints (though typically handled by Kubernetes)

🧪 Test Coverage Analysis

✅ Excellent Test Coverage:

13 new test cases covering all protocol combinations
Edge case coverage: nil protocols, missing ports, path variations
Probe comparison testing: HTTP vs TCP, different ports/paths
Integration scenarios: deployment update triggering

💡 Test Improvements Needed:

Fix hardcoded path expectations in tests (Line 148)
Add negative test cases for invalid port ranges
Test error scenarios (malformed protocol configurations)

🚀 Critical Recommendations

Must Fix Before Merge:

Fix Hardcoded Path Logic (Line 424):

func (r *AgentReconciler) getA2AHealthPath(protocol *runtimev1alpha1.AgentProtocol) string {
    basePath := agentCardEndpoint
    if protocol != nil && protocol.Path != "" {
        if protocol.Path == "/" {
            return basePath
        }
        return protocol.Path + basePath
    }
    // Use reasonable default instead of hardcoded "/a2a"
    return basePath // Remove hardcoded "/a2a" prefix
}

Update Tests to match flexible implementation instead of hardcoded expectations
Update Documentation to reflect actual behavior

Medium Priority Improvements:

Add port validation for protocol configurations
Improve error messages for probe generation failures
Consider making probe timeouts configurable

Low Priority Optimizations:

Consider caching protocol lookups for performance
Add metrics for health check success/failure rates

🎯 Conclusion

This is a well-architected implementation with excellent test coverage that will significantly improve agent deployment reliability. The protocol-aware health checking is a valuable feature following Kubernetes best practices.

However, the hardcoded path issue is a critical bug that prevents this PR from working correctly with template agents and other custom configurations. The reviewer feedback is spot-on and must be addressed before merge.

Recommendation: Fix the hardcoded path logic and update corresponding tests, then this PR will be ready for merge.

claude · 2025-09-30T08:24:38Z

Claude finished @qa-jil-kamerling's task —— View job

Conducting comprehensive PR review

Todo List

Read and analyze core controller implementation changes
Review test implementation and coverage
Examine sample configurations
Check documentation updates
Address specific reviewer feedback points
Provide comprehensive feedback on code quality, bugs, performance, security, and test coverage

Comprehensive Code Review

Overall Assessment: Strong Implementation with Critical Bug

This PR implements protocol-aware health checking for Agent deployments with comprehensive test coverage (649+ new lines of tests). The architecture is sound and follows Kubernetes best practices, but there is a critical hardcoded path bug that prevents this from working correctly with template agents.

🔧 Code Quality & Architecture

✅ Strengths:

Clean separation of concerns: Health probe logic is well-organized with dedicated helper functions
Protocol-aware design: Smart prioritization (A2A > OpenAI > None) matches business requirements
Comprehensive test coverage: 13 new test cases covering edge cases and protocol combinations
Proper error handling: Good use of early returns and nil checks
Kubernetes best practices: Proper probe configuration with appropriate timeouts and thresholds

⚠️ Critical Issues to Address:

1. Hardcoded Health Check Paths (internal/controller/agent_controller.go:424)

The reviewer g3force is absolutely correct - there's a critical flaw:

func (r *AgentReconciler) getA2AHealthPath(protocol *runtimev1alpha1.AgentProtocol) string {
    // ...
    return "/a2a" + basePath  // Line 424 - Hardcoded "/a2a" prefix
}

Problem: This hardcodes /a2a/.well-known/agent-card.json which breaks:

Template agents that may not have the /a2a path prefix
Weather agents with different path structures (as seen in config/samples/runtime_v1alpha1_agent_template.yaml:24 with path: "/")
Agents using custom protocol path configurations

Impact: A2A agents will fail readiness checks, causing deployment failures.

2. Test Inconsistencies (internal/controller/agent_controller_test.go:148)

Tests are using hardcoded path expectations that don't match the flexible implementation:

Expect(probe.HTTPGet.Path).To(Equal(a2aAgentCardEndpoint))  // Should test actual protocol config

3. Documentation Mismatch (CLAUDE.md:195)
Documentation claims paths are configurable but references hardcoded defaults that don't match implementation.

🐛 Potential Bugs

1. Path Configuration Bug (High Priority)

Template agents with path: "/" will still get /a2a prefix, breaking health checks
No validation that protocol paths are valid HTTP paths

2. Missing Port Validation

No validation that protocol ports are in valid range (1-65535)
Could lead to invalid probe configurations

3. Protocol Configuration Edge Cases

No handling for malformed protocol configurations
Missing validation for required protocol fields

⚡ Performance Considerations

✅ Good Performance:

Efficient deployment update detection using equality.ProbesEqual() prevents unnecessary Kubernetes API calls
Smart probe comparison logic minimizes reconciliation overhead
Protocol detection uses simple iteration (acceptable for small protocol lists)

💡 Minor Optimizations:

Protocol lookup could be cached if agents frequently have many protocols
Map-based protocol lookup could improve O(n) to O(1) for large protocol lists

🔒 Security Assessment

✅ Security Strengths:

No exposure of sensitive information in health check configurations
Proper use of Kubernetes readiness probes (non-privileged)
No hardcoded credentials or secrets

⚠️ Security Considerations:

A2A health check endpoints could expose internal agent metadata
No rate limiting on health check endpoints (though typically handled by Kubernetes)

🧪 Test Coverage Analysis

✅ Excellent Test Coverage:

13 new test cases covering all protocol combinations
Edge case coverage: nil protocols, missing ports, path variations
Probe comparison testing: HTTP vs TCP, different ports/paths
Integration scenarios: deployment update triggering

💡 Test Improvements Needed:

Fix hardcoded path expectations in tests (Line 148)
Add negative test cases for invalid port ranges
Test error scenarios (malformed protocol configurations)

🚀 Critical Recommendations

Must Fix Before Merge:

Fix Hardcoded Path Logic (Line 424):

func (r *AgentReconciler) getA2AHealthPath(protocol *runtimev1alpha1.AgentProtocol) string {
    basePath := agentCardEndpoint
    if protocol != nil && protocol.Path != "" {
        if protocol.Path == "/" {
            return basePath
        }
        return protocol.Path + basePath
    }
    // Use reasonable default instead of hardcoded "/a2a"
    return basePath // Remove hardcoded "/a2a" prefix
}

Update Tests to match flexible implementation instead of hardcoded expectations
Update Documentation to reflect actual behavior

Medium Priority Improvements:

Add port validation for protocol configurations
Improve error messages for probe generation failures
Consider making probe timeouts configurable

Low Priority Optimizations:

Consider caching protocol lookups for performance
Add metrics for health check success/failure rates

🎯 Conclusion

This is a well-architected implementation with excellent test coverage that will significantly improve agent deployment reliability. The protocol-aware health checking is a valuable feature following Kubernetes best practices.

However, the hardcoded path issue is a critical bug that prevents this PR from working correctly with template agents and other custom configurations. The reviewer feedback is spot-on and must be addressed before merge.

Recommendation: Fix the hardcoded path logic and update corresponding tests, then this PR will be ready for merge.

…rd if a2a else execute a tcp probe; update samples for e2e tests; update CLAUDE.md

-update path in samples -update controller test -remove unused sample -update CLAUDE.md -undo changes to e2e_test.go -move probe equality check to equality.go and reduce complexity

g3force · 2025-09-30T11:13:57Z

config/samples/runtime_v1alpha1_agent_openai.yaml

+  name: openai-agent
+spec:
+  framework: custom
+  image: ghcr.io/example/openai-agent:latest


There it is! 🙈 I've looked for the source of the openai-agent recently in our cluster...

Please do not include invalid agents in the sample directory. It should only include working examples.

g3force · 2025-09-30T11:16:31Z

internal/controller/agent_controller.go

+	basePath := agentCardEndpoint
+	if protocol != nil && protocol.Path != "" {
+		// If path is explicitly specified, use it
+		// Special case: "/" means root path (no prefix)


I don't see a need for a special case here. The default is an empty path. If a user explicitly configures /, it should be used as-is.

g3force · 2025-09-30T11:17:15Z

internal/controller/agent_controller.go

+
+// getA2AHealthPath returns the A2A health check path based on protocol configuration
+func (r *AgentReconciler) getA2AHealthPath(protocol *runtimev1alpha1.AgentProtocol) string {
+	basePath := agentCardEndpoint


basePath is usually referred to as the prefix, not the suffix. The variable indirection is not really needed here anyways.

g3force · 2025-09-30T11:24:07Z

@claude review this

claude · 2025-09-30T11:24:21Z

Claude finished @g3force's task —— View job

Conducting comprehensive PR review

Todo List

Read and analyze core controller implementation changes
Review test implementation and coverage
Examine sample configurations
Check documentation updates
Address specific reviewer feedback points
Provide comprehensive feedback on code quality, bugs, performance, security, and test coverage

Comprehensive Code Review

Overall Assessment: Strong Implementation with Critical Issues

This PR implements protocol-aware health checking for Agent deployments with comprehensive test coverage (649+ new lines of tests). The architecture is sound and follows Kubernetes best practices, but there are critical issues that must be addressed before merge.

🔧 Code Quality & Architecture

✅ Strengths:

Clean separation of concerns: Health probe logic is well-organized with dedicated helper functions
Protocol-aware design: Smart prioritization (A2A > OpenAI > None) matches business requirements
Comprehensive test coverage: 13 new test cases covering edge cases and protocol combinations
Proper error handling: Good use of early returns and nil checks
Kubernetes best practices: Proper probe configuration with appropriate timeouts and thresholds

⚠️ Critical Issues to Address:

1. Hardcoded Health Check Paths (internal/controller/agent_controller.go:423)

The reviewer g3force is absolutely correct - there's a critical flaw:

func (r *AgentReconciler) getA2AHealthPath(protocol *runtimev1alpha1.AgentProtocol) string {
    basePath := agentCardEndpoint
    if protocol != nil && protocol.Path != "" {
        // ...
        return protocol.Path + basePath
    }
    // Default for agents without protocol specification or path
    return "/a2a" + basePath  // Line 423 - Hardcoded "/a2a" prefix
}

Problem: This hardcodes /a2a/.well-known/agent-card.json which breaks:

Template agents that may not have the /a2a path prefix
Weather agents with different path structures (as seen in config/samples/runtime_v1alpha1_agent_template.yaml:24 with path: "/")
Agents using custom protocol path configurations

Impact: A2A agents will fail readiness checks, causing deployment failures.

2. Special Case Handling Issue (internal/controller/agent_controller.go:416-418)

if protocol.Path == "/" {
    return basePath
}

As g3force correctly notes, there's no need for this special case. The logic should treat all paths uniformly.

3. Test Inconsistencies (internal/controller/agent_controller_test.go:148)

Tests are using hardcoded path expectations that don't match the actual protocol configurations:

Expect(probe.HTTPGet.Path).To(Equal("/a2a" + agentCardEndpoint))  // Should test actual protocol config

4. Invalid Sample Agents

The runtime_v1alpha1_agent_openai.yaml sample references a non-existent image (ghcr.io/example/openai-agent:latest). As g3force notes, samples should only include working examples.

🐛 Potential Bugs

1. Path Configuration Bug (High Priority)

Template agents with path: "/" will still get /a2a prefix in the default case, breaking health checks
No validation that protocol paths are valid HTTP paths

2. Missing Port Validation

No validation that protocol ports are in valid range (1-65535)
Could lead to invalid probe configurations

3. Protocol Configuration Edge Cases

No handling for malformed protocol configurations
Missing validation for required protocol fields

⚡ Performance Considerations

✅ Good Performance:

Efficient deployment update detection using equality.ProbesEqual() prevents unnecessary Kubernetes API calls
Smart probe comparison logic minimizes reconciliation overhead
Protocol detection uses simple iteration (acceptable for small protocol lists)

💡 Minor Optimizations:

Protocol lookup could be cached if agents frequently have many protocols
Map-based protocol lookup could improve O(n) to O(1) for large protocol lists

🔒 Security Assessment

✅ Security Strengths:

No exposure of sensitive information in health check configurations
Proper use of Kubernetes readiness probes (non-privileged)
No hardcoded credentials or secrets

⚠️ Security Considerations:

A2A health check endpoints could expose internal agent metadata
No rate limiting on health check endpoints (though typically handled by Kubernetes)

🧪 Test Coverage Analysis

✅ Excellent Test Coverage:

13 new test cases covering all protocol combinations
Edge case coverage: nil protocols, missing ports, path variations
Probe comparison testing: HTTP vs TCP, different ports/paths
Integration scenarios: deployment update triggering

💡 Test Improvements Needed:

Fix hardcoded path expectations in tests
Add negative test cases for invalid port ranges
Test error scenarios (malformed protocol configurations)

🚀 Critical Recommendations

Must Fix Before Merge:

Fix Hardcoded Path Logic (Line 423):

func (r *AgentReconciler) getA2AHealthPath(protocol *runtimev1alpha1.AgentProtocol) string {
    basePath := agentCardEndpoint
    if protocol != nil && protocol.Path != "" {
        return protocol.Path + basePath
    }
    // Use reasonable default instead of hardcoded "/a2a"
    return basePath // Remove hardcoded "/a2a" prefix entirely
}

Remove Special Case Handling (Lines 416-418):
- Eliminate the if protocol.Path == "/" special case
- Treat all paths uniformly
Update Tests to match flexible implementation instead of hardcoded expectations
Fix Sample Configurations to use working, real agent images

Medium Priority Improvements:

Add port validation for protocol configurations
Improve error messages for probe generation failures
Consider making probe timeouts configurable

Low Priority Optimizations:

Consider caching protocol lookups for performance
Add metrics for health check success/failure rates

🎯 Conclusion

This is a well-architected implementation with excellent test coverage that will significantly improve agent deployment reliability. The protocol-aware health checking is a valuable feature following Kubernetes best practices.

However, the hardcoded path issues are critical bugs that prevent this PR from working correctly with template agents and other custom configurations. The reviewer feedback is spot-on and must be addressed before merge.

Recommendation: Fix the hardcoded path logic, remove special case handling, update corresponding tests, and fix sample configurations. Then this PR will be ready for merge.

This comment was marked as outdated.

Sign in to view

qa-jil-kamerling force-pushed the feature/PAAL-123-kubernetes-health-probes branch from f0830f8 to d49c9ee Compare September 24, 2025 11:59

This comment was marked as outdated.

Sign in to view

qa-jil-kamerling marked this pull request as ready for review September 24, 2025 12:07

g3force self-requested a review September 24, 2025 12:09

g3force requested changes Sep 24, 2025

View reviewed changes

qa-jil-kamerling force-pushed the feature/PAAL-123-kubernetes-health-probes branch 2 times, most recently from 3335980 to 0df4f04 Compare September 25, 2025 10:33

qa-jil-kamerling commented Sep 25, 2025

View reviewed changes

CLAUDE.md Outdated Show resolved Hide resolved

qa-jil-kamerling commented Sep 25, 2025

View reviewed changes

CLAUDE.md Outdated Show resolved Hide resolved

qa-jil-kamerling force-pushed the feature/PAAL-123-kubernetes-health-probes branch from 0df4f04 to 445cff3 Compare September 25, 2025 11:02

This comment was marked as outdated.

Sign in to view

qa-jil-kamerling force-pushed the feature/PAAL-123-kubernetes-health-probes branch from f9cfd7b to c0bd81d Compare September 26, 2025 09:10

This comment was marked as outdated.

Sign in to view

qa-jil-kamerling force-pushed the feature/PAAL-123-kubernetes-health-probes branch from c0bd81d to 75df72c Compare September 26, 2025 09:11

This comment was marked as outdated.

Sign in to view

qa-jil-kamerling commented Sep 26, 2025

View reviewed changes

internal/controller/agent_controller.go Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

qa-jil-kamerling force-pushed the feature/PAAL-123-kubernetes-health-probes branch from ac63753 to 9ac19c1 Compare September 26, 2025 09:24

This comment was marked as outdated.

Sign in to view

qa-jil-kamerling force-pushed the feature/PAAL-123-kubernetes-health-probes branch from 8204f17 to 3a0042e Compare September 26, 2025 10:08

This comment was marked as outdated.

Sign in to view

qa-jil-kamerling force-pushed the feature/PAAL-123-kubernetes-health-probes branch 4 times, most recently from 4560a05 to c93ac5a Compare September 30, 2025 11:04

qa-jil-kamerling added 3 commits September 30, 2025 13:05

feat: PAAL-123 update agent controller; checks for available agent ca…

1ecbe91

…rd if a2a else execute a tcp probe; update samples for e2e tests; update CLAUDE.md

feat: PAAL-123 implement review comments:

5418b82

-update path in samples -update controller test -remove unused sample -update CLAUDE.md -undo changes to e2e_test.go -move probe equality check to equality.go and reduce complexity

feat: PAAL-123 increase times for readiness probe checks

30d6159

qa-jil-kamerling force-pushed the feature/PAAL-123-kubernetes-health-probes branch from c93ac5a to 30d6159 Compare September 30, 2025 11:05

g3force reviewed Sep 30, 2025

View reviewed changes

PAAL-123 kubernetes health probes #21

Are you sure you want to change the base?

PAAL-123 kubernetes health probes #21

Uh oh!

Conversation

qa-jil-kamerling commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implement Protocol-Aware Health Checking for Agent Deployments

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

claude bot commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Todo List

Comprehensive Code Review

Overall Assessment: Strong Implementation with Critical Bug

🔧 Code Quality & Architecture

✅ Strengths:

⚠️ Critical Issues to Address:

🐛 Potential Bugs

⚡ Performance Considerations

✅ Good Performance:

💡 Minor Optimizations:

🔒 Security Assessment

✅ Security Strengths:

⚠️ Security Considerations:

🧪 Test Coverage Analysis

✅ Excellent Test Coverage:

💡 Test Improvements Needed:

🚀 Critical Recommendations

Must Fix Before Merge:

Medium Priority Improvements:

Low Priority Optimizations:

🎯 Conclusion

Uh oh!

This comment was marked as outdated.

claude bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Todo List

Comprehensive Code Review

Overall Assessment: Strong Implementation with Critical Bug

🔧 Code Quality & Architecture

✅ Strengths:

⚠️ Critical Issues to Address:

🐛 Potential Bugs

⚡ Performance Considerations

✅ Good Performance:

💡 Minor Optimizations:

🔒 Security Assessment

✅ Security Strengths:

qa-jil-kamerling commented Sep 24, 2025 •

edited

Loading

claude bot commented Sep 26, 2025 •

edited

Loading

claude bot commented Sep 30, 2025 •

edited

Loading

claude bot commented Sep 30, 2025 •

edited

Loading