Skip to content

Conversation

@musharafmaqbool
Copy link

@musharafmaqbool musharafmaqbool commented Sep 6, 2025

Summary

Replaces verbose logging with structured Kubernetes events for IP allocation failures, providing actionable diagnostics to users and operators.

What type of PR is this?

enhancement

Changes Made

  • Replaced verbose logging with structured Kubernetes events on nodes
  • Added detailed diagnostic information including:
    • Specific failure reasons (subnet exhaustion, ENI limits, fragmentation)
    • Available IP counts and subnet details
    • ENI utilization and limits
    • Actionable guidance for operators
  • Reduced log verbosity while maintaining essential debug information
  • Enhanced user experience - failures now visible via kubectl describe node

Why this change?

Addresses reviewer feedback on PR #3429. Instead of burying diagnostic information in verbose logs, this surfaces the "why" of IP allocation failures through Kubernetes events that users can actually see and act upon.

Testing

  • Unit tests for event emission logic
  • Manual testing of IP allocation failure scenarios
  • Verified events appear correctly in kubectl describe node output

Result

  • Users get clear, actionable information when pods can't get IPs
  • Operators can quickly identify if issues are due to fragmentation, ENI limits, or capacity
  • Reduced log noise while improving diagnostics

Closes #3429 (replaces previous implementation)

…ws#3415

Enhanced error messages for IP address allocation failures to provide more detailed diagnostic information as requested in issue aws#3415.

Changes made:
- Added current IP usage statistics (used/available) to error messages
- Included ENI limit information in allocation failure messages  
- Enhanced subnet configuration context in error logs
- Provided specific failure reasons to assist with troubleshooting

This improvement will help users quickly identify the root cause of IP assignment failures:
- Whether it's due to subnet IP exhaustion
- ENI limits being reached
- VPC/subnet configuration issues
- IP warming delays

The enhanced error messages follow the format:
"failed to allocate IP on ENI %s: %v. Usage: %d/%d IPs in subnet, ENI limit: %d/%d"
Enhance IP assignment error messages with detailed diagnostics  Fixes…
…tation info

Follow-up to #[previous PR number] addressing feedback from @erezzarum

**Changes made:**
- Added ENI count and limit information to allocation failures
- Enhanced IPv4/IPv6 prefix allocation errors with current usage stats
- Included fragmentation detection in error messages
- Added trunk ENI mode and prefix delegation context

**Addresses the following issues:**
1. Better root cause identification for prefix delegation scenarios
2. ENI allocation status information in error messages  
3. More detailed fragmentation context for troubleshooting

**Testing:**
Enhanced error messages now provide actionable diagnostic information for common IP allocation failure scenarios.
Enhance IP allocation error diagnostics with detailed ENI and fragmen…
…ailures

@jaydeokar Thank you for the excellent feedback! I've updated the implementation to address your concerns:

🔄 **Changes Made:**
- **Replaced verbose logging** with structured Kubernetes events
- **Events are emitted on nodes** when IP allocation fails - users can see them via `kubectl describe node`
- **Detailed diagnostic information** including:
  - Specific failure reason (subnet exhaustion, ENI limits, fragmentation)
  - Available IP counts and subnet details
  - Actionable guidance for operators
- **Reduced log verbosity** - keeping only essential debug information

🎯 **Result:**
- Users now get clear, actionable information about why their pods can't get IPs
- The "why" is surfaced through Kubernetes events instead of buried in logs
- Operators can quickly identify if it's fragmentation, ENI limits, or genuine capacity issues

This addresses the core issue you raised about providing meaningful diagnostics to users rather than just verbose logs for debugging. The gRPC handler now gets structured error information while users get events they can actually act on.

Ready for re-review! 🚀
@musharafmaqbool musharafmaqbool requested a review from a team as a code owner September 6, 2025 06:24
@musharafmaqbool
Copy link
Author

@labria hey can you please review my pr?

@larhauga
Copy link

larhauga commented Nov 3, 2025

@labria sorry for pinging you here, but we are trying to debug high latency when allocating ENIs and would really like this PR to extend our debugging options. Would you be able to review this?

@musharafmaqbool
Copy link
Author

@labria sure i will loook it into it

rcv1alpha1 "github.com/aws/amazon-vpc-resource-controller-k8s/apis/vpcresources/v1alpha1"
)

// Add these type definitions right after the imports, before the package comment:
Copy link
Contributor

@yash97 yash97 Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove these comments which looks like added by LLM

Comment on lines +78 to +81
// The package ipamd is a long running daemon which manages a warm pool of available IP addresses.
// It also monitors the size of the pool, dynamically allocates more ENIs when the pool size goes below
// the minimum threshold and frees them back when the pool size goes above max threshold.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate comment here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants