-
Notifications
You must be signed in to change notification settings - Fork 794
Enhance IP allocation error diagnostics with detailed ENI and fragmentation info #3432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Enhance IP allocation error diagnostics with detailed ENI and fragmentation info #3432
Conversation
…ws#3415 Enhanced error messages for IP address allocation failures to provide more detailed diagnostic information as requested in issue aws#3415. Changes made: - Added current IP usage statistics (used/available) to error messages - Included ENI limit information in allocation failure messages - Enhanced subnet configuration context in error logs - Provided specific failure reasons to assist with troubleshooting This improvement will help users quickly identify the root cause of IP assignment failures: - Whether it's due to subnet IP exhaustion - ENI limits being reached - VPC/subnet configuration issues - IP warming delays The enhanced error messages follow the format: "failed to allocate IP on ENI %s: %v. Usage: %d/%d IPs in subnet, ENI limit: %d/%d"
Enhance IP assignment error messages with detailed diagnostics Fixes…
…tation info Follow-up to #[previous PR number] addressing feedback from @erezzarum **Changes made:** - Added ENI count and limit information to allocation failures - Enhanced IPv4/IPv6 prefix allocation errors with current usage stats - Included fragmentation detection in error messages - Added trunk ENI mode and prefix delegation context **Addresses the following issues:** 1. Better root cause identification for prefix delegation scenarios 2. ENI allocation status information in error messages 3. More detailed fragmentation context for troubleshooting **Testing:** Enhanced error messages now provide actionable diagnostic information for common IP allocation failure scenarios.
Enhance IP allocation error diagnostics with detailed ENI and fragmen…
…ailures @jaydeokar Thank you for the excellent feedback! I've updated the implementation to address your concerns: 🔄 **Changes Made:** - **Replaced verbose logging** with structured Kubernetes events - **Events are emitted on nodes** when IP allocation fails - users can see them via `kubectl describe node` - **Detailed diagnostic information** including: - Specific failure reason (subnet exhaustion, ENI limits, fragmentation) - Available IP counts and subnet details - Actionable guidance for operators - **Reduced log verbosity** - keeping only essential debug information 🎯 **Result:** - Users now get clear, actionable information about why their pods can't get IPs - The "why" is surfaced through Kubernetes events instead of buried in logs - Operators can quickly identify if it's fragmentation, ENI limits, or genuine capacity issues This addresses the core issue you raised about providing meaningful diagnostics to users rather than just verbose logs for debugging. The gRPC handler now gets structured error information while users get events they can actually act on. Ready for re-review! 🚀
|
@labria hey can you please review my pr? |
|
@labria sorry for pinging you here, but we are trying to debug high latency when allocating ENIs and would really like this PR to extend our debugging options. Would you be able to review this? |
|
@labria sure i will loook it into it |
| rcv1alpha1 "github.com/aws/amazon-vpc-resource-controller-k8s/apis/vpcresources/v1alpha1" | ||
| ) | ||
|
|
||
| // Add these type definitions right after the imports, before the package comment: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remove these comments which looks like added by LLM
| // The package ipamd is a long running daemon which manages a warm pool of available IP addresses. | ||
| // It also monitors the size of the pool, dynamically allocates more ENIs when the pool size goes below | ||
| // the minimum threshold and frees them back when the pool size goes above max threshold. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate comment here
Summary
Replaces verbose logging with structured Kubernetes events for IP allocation failures, providing actionable diagnostics to users and operators.
What type of PR is this?
enhancement
Changes Made
kubectl describe nodeWhy this change?
Addresses reviewer feedback on PR #3429. Instead of burying diagnostic information in verbose logs, this surfaces the "why" of IP allocation failures through Kubernetes events that users can actually see and act upon.
Testing
kubectl describe nodeoutputResult
Closes #3429 (replaces previous implementation)