-
Notifications
You must be signed in to change notification settings - Fork 181
feat(config): ipv6 compatibiilty for otel collector pods #3949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ipv6 compatibility fixes and UT's for the same
@@ -125,7 +129,7 @@ receivers: | |||
regex: (.+) | |||
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] | |||
action: replace | |||
regex: ([^:]+)(?::\d+)?;(\d+) | |||
regex: (\[.*?\]|[^:;\[]+)(?::\d+)?;(\d+) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might make sense to have an integration test just for metrics collection because of this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a test for checking metrics scraped by pod-annotations job here
This is included in Helm_Default_OT test where default metrics are ensured by setting up an nginx pod with prometheus.io/scrape: true and pod-annotations job scrape this pod. We can say existing functionality of the regex scraping pods with ipv4 address in already validated in existing IT.
For ipv6 mode, we need to setup the cluster itself in ipv6 single-stack mode which will be a major change to IT's. Once this is done, all tests can be run in ipv6 mode too without introducing new tests specific for ipv6.
@rnishtala-sumo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind
supports IPv6 clusters via its configuration file
Here's where the current kind config is defined - https://github.com/SumoLogic/sumologic-kubernetes-collection/blob/main/tests/integration/yamls/cluster.yaml
For ipv6 mode, we need to setup the cluster itself in ipv6 single-stack mode which will be a major change to IT's
We could define a cluster config for ipv6 and use it conditionally for one test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test_Helm_Default_OT_NamespaceOverride 2025-06-23T17:38:23Z logger.go:67: │ Error: Failed to query available provider packages
Test_Helm_Default_OT_NamespaceOverride 2025-06-23T17:38:23Z logger.go:67: │
Test_Helm_Default_OT_NamespaceOverride 2025-06-23T17:38:23Z logger.go:67: │ Could not retrieve the list of available versions for provider
Test_Helm_Default_OT_NamespaceOverride 2025-06-23T17:38:23Z logger.go:67: │ sumologic/sumologic: could not connect to registry.terraform.io: failed to
Test_Helm_Default_OT_NamespaceOverride 2025-06-23T17:38:23Z logger.go:67: │ request discovery document: Get
Test_Helm_Default_OT_NamespaceOverride 2025-06-23T17:38:23Z logger.go:67: │ "https://registry.terraform.io/.well-known/terraform.json": dial tcp:
Test_Helm_Default_OT_NamespaceOverride 2025-06-23T17:38:23Z logger.go:67: │ lookup registry.terraform.io on [fd00:10:96::a]:53: server misbehaving
Tried creating the ipv6 cluster. It's created fine, but we need to setup the test running machine to be able to access internet using ipv6 config as mentioned here
It's not very simple to setup a proper ipv6 cluster with networking in existing IT's. I think it should be separate proper task itself to setup ipv6 based IT's.
- If we want to be safe, we can have new regex only when sumologic.ipv6mode enabled, else we can use the existing old regex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we need either ITs or E2Es before rolling this feature out. We're not using AWS for running the ITs.
Might be worth looking at this - https://github.com/kubernetes-sigs/nat64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creating of IPv6 test setup itself is taking lot of changes and IMO, it should be separate task by itself
Then I recommend adding vagrant documentation with test instructions to setup an ipv6 cluster for local testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added IT for ipv6. @rnishtala-sumo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed the changes and appreciate the effort. I think we should think about the limitation we're running into more carefully before rolling this out.
Here's what we could do -
- Remove the IT from this PR and changes around that.
- Add test instructions to our vagrant doc to setup an ipv6 cluster and deploy the helm chart, so someone from the team could use those instructions to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rnishtala-sumo
IT's with setup disabled is helping us test out all the features with ipv6.
Even when we add test steps in vagrant, it would be disabling the setup job and install the helm chart since configuring NAT64 will require substantial effort and analysis to implement.
This disableSetupJob is just in IT's and better than not having IT's at all right? I don't understand the issue with disabling the setupjob for ipv6 tests alone as long as it gives us feasibility to test functionalities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All tests install/deploy the helm chart. I'd rather be sure that there aren't issues with the installation including setup on an ipv6 cluster than turn it off. This is a pre-requisite for collecting any signal.
My recommendation is to ensure that there are no installation issues (including setup), even if this is manual (documented) to start with.
We need a changelog for this. The approach makes sense to me. Lets ensure that the integration tests pass. It might also make sense to write an integration test for metrics collection. |
https://github.com/SumoLogic/sumologic-kubernetes-collection/actions/runs/15817837079/job/44580109359?pr=3949 I've checked in other PR's too, this test is failing/passing randomly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting ITs for this change because of its large footprint.
@@ -8,5 +8,6 @@ | |||
"MD024": false, // no-duplicate-heading/no-duplicate-header | |||
"MD033": false, // no-inline-html | |||
"MD034": false, // no-bare-urls | |||
"MD040": false // fenced-code-language | |||
"MD040": false, // fenced-code-language | |||
"MD059": false // disable checks for descriptive link text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exclusion added since we have changes in readme, seems a newer validation added in recent lint checks to have only descriptive link texts, we have the name as "link" in many existing places. So, skipping this rule
@@ -28,6 +28,7 @@ The following table lists the configurable parameters of the Sumo Logic chart an | |||
| `nameOverride` | Used to override the Chart name. | `Nil` | | |||
| `fullnameOverride` | Used to override the chart's full name. Names longer than 22 characters will be truncated. | `Nil` | | |||
| `namespaceOverride` | Used to override the chart's default target namepace. | `Nil` | | |||
| `sumologic.ipv6mode` | If enabled, configures the chart to be compatible with ipv6 single stack clusters | `false` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All changes done here are behind this feature flag only
regex: (.*);(.*) | ||
replacement: "[$1]:$2" | ||
target_label: __address__ | ||
{{- else }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constructing address from podip and prometheus port, since old flow which we use below for ipv4 gives different address values for ipv6 only clusters.
So, it's a better approach to construct this from podIP.
Ex. formats observed in address value
- ipv6:port
- ipv6
@@ -116,13 +117,15 @@ func GetMetricsK8sattributes(expectedMetrics []string, metricsCollector MetricsC | |||
Feature() | |||
} | |||
|
|||
func GetMetricsFeature(expectedMetrics []string, metricsCollector MetricsCollector) features.Feature { | |||
// addKubeVersionSpecificMetrics - adds metrics that are specific to the Kubernetes version to expectedMetrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Existing WaitUntilExpectedMetricsPresent() method is adding some core-dns metrics to expectedMetrics internally based on k8's version. Using this flag to check only the values passed in expectedMetrics list and skip adding any additional metrics for validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not do this. Instead, there is a way to add coredns_cache_hits_total
to an optional list. The list of expected metrics is maintained in this file - https://github.com/SumoLogic/sumologic-kubernetes-collection/blob/v4.13.0/tests/integration/internal/constants.go
featTraces := GetTracesFeature() | ||
|
||
testenv.Test(t, featInstall, featMetrics, featTelegrafMetrics, featLogs, featMultilineLogs, featEvents, featTraces) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test will all default functionalities tested for IPv6
// HelmInstallTestOpt picks a values file from `values` directory | ||
} | ||
if isIPv6Test { | ||
beforeFuncs = append(beforeFuncs, stepfuncs.KubectlCreateSumologicSecretOpt()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Secret created manually in ipv6 clusters since we needed to disabled sumologic.setupEnabled flag to prevent installation failure due to terraform registry not reachable. Just "dummy" values added as secrets, refer sumologic-secrets.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The customer is going to deploy helm charts with the setup job enabled. Have you tried using https://github.com/kubernetes-sigs/nat64? If this isn't working I suggest we add manual setup/test instructions in the vagrant readme to test this locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried nat64, but didn't work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disabling the setup job to test ipv6 is not preferred. The setup job does more than managing secrets. Instead of implementing this workaround, we should consider adding manual test instructions to our vagrant docs if using dns64 or nat64 is complex.
@@ -116,13 +117,15 @@ func GetMetricsK8sattributes(expectedMetrics []string, metricsCollector MetricsC | |||
Feature() | |||
} | |||
|
|||
func GetMetricsFeature(expectedMetrics []string, metricsCollector MetricsCollector) features.Feature { | |||
// addKubeVersionSpecificMetrics - adds metrics that are specific to the Kubernetes version to expectedMetrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not do this. Instead, there is a way to add coredns_cache_hits_total
to an optional list. The list of expected metrics is maintained in this file - https://github.com/SumoLogic/sumologic-kubernetes-collection/blob/v4.13.0/tests/integration/internal/constants.go
"github.com/SumoLogic/sumologic-kubernetes-collection/tests/integration/internal/stepfuncs" | ||
) | ||
|
||
func FilterOutIPv6ExcludedMetrics(metrics []string) []string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this filter in favor of maintaining the list of expected metrics here - https://github.com/SumoLogic/sumologic-kubernetes-collection/blob/v4.13.0/tests/integration/internal/constants.go
@@ -43,7 +43,7 @@ func Test_Helm_Default_OT_NamespaceOverride(t *testing.T) { | |||
|
|||
featInstall := GetInstallFeature(installChecks) | |||
|
|||
featMetrics := GetMetricsFeature(expectedMetrics, Prometheus) | |||
featMetrics := GetMetricsFeature(expectedMetrics, Prometheus, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, please undo
// We can't do it earlier, because we run the tests for different k8s versions | ||
// and we can't fetch current version earlier | ||
expectedMetrics = append(expectedMetrics, internal.GetVersionDependentMetrics(t)...) | ||
if addKubeVersionSpecificMetrics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't need this anymore based on the above comment
|
||
// KubectlCreateSumologicSecretOpt returns a features.Func that will create a secret called "sumologic" | ||
// in the namespace where the chart is installed using the YAML file. | ||
func KubectlCreateSumologicSecretOpt() features.Func { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be doing this outside the helm chart installation to test ipv6 support.
// HelmInstallTestOpt picks a values file from `values` directory | ||
} | ||
if isIPv6Test { | ||
beforeFuncs = append(beforeFuncs, stepfuncs.KubectlCreateSumologicSecretOpt()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The customer is going to deploy helm charts with the setup job enabled. Have you tried using https://github.com/kubernetes-sigs/nat64? If this isn't working I suggest we add manual setup/test instructions in the vagrant readme to test this locally.
@@ -0,0 +1,24 @@ | |||
# This secret is used for ipv6 clusters where we disabled setup job which is responsible for creating the secret. | |||
# Reason is the limitation of Github runners are ipv4 only. So setting up NAT64 would be very tricky and tedious and we also don't need backend url's for IT's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disabling the setup job to test ipv6 is not preferred. The setup job does more than managing secrets. Instead of implementing this workaround, we should consider adding manual test instructions to our vagrant docs if using dns64 or nat64 is complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that setupjob could do multiple things. But there are multiple tests to test setup job features, but this test focuses on specifically IPv6 communication and helped me even catch the issue with previously proposed regex for scrape job(pod-annotations) and to forced me to fix it. Similarly i think this test will help us capture much issues coming in ipv6 setups.
In future , if we succeed settingup nat64, we can just remove secretcreation and enable setup job. Till then, IMO, this internal ipv6 test will help us to a good extent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting that the IT be removed for now, until we have a solution for the setup job. This can live on a branch. Recommend that our vagrant (developer) docs be updated with instructions on testing this.
Even vagrant test steps will be based on disabling setupJob and test helm chart further. Hope this would be fine for local test steps @rnishtala-sumo |
Collector pods were crashing in ipv6 env due to invalid ip address format used for exposing endpoints such as health check, liveliness and otlp/http/grpc server endpoints etc.
We are constructing these endpoints using podIP env variable , need to enclose it with square braces. Added a generic flag to configure this for all endpoints.
Regex used for extracting pod ip and port in pod-annotations scrape config does not support ipv6 parsing.
Introduced a method to construct scrape address pod ip and prometheus port
Both above changes are behind a feature flag - sumologic.ipv6mode
Detailed summary of the changes: https://docs.google.com/document/d/1aBfne-cN6k9p_Lw3zZ2ZqY8Ho0AHYTzQB-nIpYx--GA
These are the issue faced by a customer when setting up helm chart in ipv6 cluster. These fixes are deployed in customer setup using manual overrides using config merge.
Jira: https://sumologic.atlassian.net/browse/OSC-1043
Checklist