Skip to content

feat(config): ipv6 compatibiilty for otel collector pods #3949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

jagan2221
Copy link

@jagan2221 jagan2221 commented Jun 20, 2025

  1. Collector pods were crashing in ipv6 env due to invalid ip address format used for exposing endpoints such as health check, liveliness and otlp/http/grpc server endpoints etc.
    We are constructing these endpoints using podIP env variable , need to enclose it with square braces. Added a generic flag to configure this for all endpoints.

  2. Regex used for extracting pod ip and port in pod-annotations scrape config does not support ipv6 parsing.
    Introduced a method to construct scrape address pod ip and prometheus port

Both above changes are behind a feature flag - sumologic.ipv6mode

Detailed summary of the changes: https://docs.google.com/document/d/1aBfne-cN6k9p_Lw3zZ2ZqY8Ho0AHYTzQB-nIpYx--GA
These are the issue faced by a customer when setting up helm chart in ipv6 cluster. These fixes are deployed in customer setup using manual overrides using config merge.
Jira: https://sumologic.atlassian.net/browse/OSC-1043

Checklist

  • Changelog updated or skip changelog label added
  • Documentation updated
  • Template tests added for new features
  • Integration tests added or modified for major features

ipv6 compatibility fixes and UT's for the same
@jagan2221 jagan2221 requested a review from a team as a code owner June 20, 2025 14:30
@@ -125,7 +129,7 @@ receivers:
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
regex: (\[.*?\]|[^:;\[]+)(?::\d+)?;(\d+)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might make sense to have an integration test just for metrics collection because of this change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a test for checking metrics scraped by pod-annotations job here

This is included in Helm_Default_OT test where default metrics are ensured by setting up an nginx pod with prometheus.io/scrape: true and pod-annotations job scrape this pod. We can say existing functionality of the regex scraping pods with ipv4 address in already validated in existing IT.

For ipv6 mode, we need to setup the cluster itself in ipv6 single-stack mode which will be a major change to IT's. Once this is done, all tests can be run in ipv6 mode too without introducing new tests specific for ipv6.
@rnishtala-sumo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind supports IPv6 clusters via its configuration file

Here's where the current kind config is defined - https://github.com/SumoLogic/sumologic-kubernetes-collection/blob/main/tests/integration/yamls/cluster.yaml

For ipv6 mode, we need to setup the cluster itself in ipv6 single-stack mode which will be a major change to IT's

We could define a cluster config for ipv6 and use it conditionally for one test.

Copy link
Author

@jagan2221 jagan2221 Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test_Helm_Default_OT_NamespaceOverride 2025-06-23T17:38:23Z logger.go:67: │ Error: Failed to query available provider packages
Test_Helm_Default_OT_NamespaceOverride 2025-06-23T17:38:23Z logger.go:67: │
Test_Helm_Default_OT_NamespaceOverride 2025-06-23T17:38:23Z logger.go:67: │ Could not retrieve the list of available versions for provider
Test_Helm_Default_OT_NamespaceOverride 2025-06-23T17:38:23Z logger.go:67: │ sumologic/sumologic: could not connect to registry.terraform.io: failed to
Test_Helm_Default_OT_NamespaceOverride 2025-06-23T17:38:23Z logger.go:67: │ request discovery document: Get
Test_Helm_Default_OT_NamespaceOverride 2025-06-23T17:38:23Z logger.go:67: │ "https://registry.terraform.io/.well-known/terraform.json": dial tcp:
Test_Helm_Default_OT_NamespaceOverride 2025-06-23T17:38:23Z logger.go:67: │ lookup registry.terraform.io on [fd00:10:96::a]:53: server misbehaving

Tried creating the ipv6 cluster. It's created fine, but we need to setup the test running machine to be able to access internet using ipv6 config as mentioned here

It's not very simple to setup a proper ipv6 cluster with networking in existing IT's. I think it should be separate proper task itself to setup ipv6 based IT's.

  • If we want to be safe, we can have new regex only when sumologic.ipv6mode enabled, else we can use the existing old regex.

@rnishtala-sumo

Copy link
Contributor

@rnishtala-sumo rnishtala-sumo Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we need either ITs or E2Es before rolling this feature out. We're not using AWS for running the ITs.

Might be worth looking at this - https://github.com/kubernetes-sigs/nat64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creating of IPv6 test setup itself is taking lot of changes and IMO, it should be separate task by itself

Then I recommend adding vagrant documentation with test instructions to setup an ipv6 cluster for local testing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added IT for ipv6. @rnishtala-sumo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed the changes and appreciate the effort. I think we should think about the limitation we're running into more carefully before rolling this out.

Here's what we could do -

  • Remove the IT from this PR and changes around that.
  • Add test instructions to our vagrant doc to setup an ipv6 cluster and deploy the helm chart, so someone from the team could use those instructions to test.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rnishtala-sumo
IT's with setup disabled is helping us test out all the features with ipv6.
Even when we add test steps in vagrant, it would be disabling the setup job and install the helm chart since configuring NAT64 will require substantial effort and analysis to implement.

This disableSetupJob is just in IT's and better than not having IT's at all right? I don't understand the issue with disabling the setupjob for ipv6 tests alone as long as it gives us feasibility to test functionalities.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All tests install/deploy the helm chart. I'd rather be sure that there aren't issues with the installation including setup on an ipv6 cluster than turn it off. This is a pre-requisite for collecting any signal.

My recommendation is to ensure that there are no installation issues (including setup), even if this is manual (documented) to start with.

@rnishtala-sumo
Copy link
Contributor

We need a changelog for this. The approach makes sense to me. Lets ensure that the integration tests pass. It might also make sense to write an integration test for metrics collection.

@jagan2221
Copy link
Author

jagan2221 commented Jun 23, 2025

https://github.com/SumoLogic/sumologic-kubernetes-collection/actions/runs/15817837079/job/44580109359?pr=3949
Something wrong with Helm_Routing_OT test - it was successful in above run for the same PR, but failing in other runs.

I've checked in other PR's too, this test is failing/passing randomly.

@jagan2221 jagan2221 requested a review from rnishtala-sumo June 23, 2025 14:49
Copy link
Contributor

@rnishtala-sumo rnishtala-sumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting ITs for this change because of its large footprint.

@@ -8,5 +8,6 @@
"MD024": false, // no-duplicate-heading/no-duplicate-header
"MD033": false, // no-inline-html
"MD034": false, // no-bare-urls
"MD040": false // fenced-code-language
"MD040": false, // fenced-code-language
"MD059": false // disable checks for descriptive link text
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exclusion added since we have changes in readme, seems a newer validation added in recent lint checks to have only descriptive link texts, we have the name as "link" in many existing places. So, skipping this rule

@@ -28,6 +28,7 @@ The following table lists the configurable parameters of the Sumo Logic chart an
| `nameOverride` | Used to override the Chart name. | `Nil` |
| `fullnameOverride` | Used to override the chart's full name. Names longer than 22 characters will be truncated. | `Nil` |
| `namespaceOverride` | Used to override the chart's default target namepace. | `Nil` |
| `sumologic.ipv6mode` | If enabled, configures the chart to be compatible with ipv6 single stack clusters | `false` |
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes done here are behind this feature flag only

regex: (.*);(.*)
replacement: "[$1]:$2"
target_label: __address__
{{- else }}
Copy link
Author

@jagan2221 jagan2221 Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constructing address from podip and prometheus port, since old flow which we use below for ipv4 gives different address values for ipv6 only clusters.
So, it's a better approach to construct this from podIP.
Ex. formats observed in address value

  1. ipv6:port
  2. ipv6

@@ -116,13 +117,15 @@ func GetMetricsK8sattributes(expectedMetrics []string, metricsCollector MetricsC
Feature()
}

func GetMetricsFeature(expectedMetrics []string, metricsCollector MetricsCollector) features.Feature {
// addKubeVersionSpecificMetrics - adds metrics that are specific to the Kubernetes version to expectedMetrics
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing WaitUntilExpectedMetricsPresent() method is adding some core-dns metrics to expectedMetrics internally based on k8's version. Using this flag to check only the values passed in expectedMetrics list and skip adding any additional metrics for validation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not do this. Instead, there is a way to add coredns_cache_hits_total to an optional list. The list of expected metrics is maintained in this file - https://github.com/SumoLogic/sumologic-kubernetes-collection/blob/v4.13.0/tests/integration/internal/constants.go

featTraces := GetTracesFeature()

testenv.Test(t, featInstall, featMetrics, featTelegrafMetrics, featLogs, featMultilineLogs, featEvents, featTraces)
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test will all default functionalities tested for IPv6

// HelmInstallTestOpt picks a values file from `values` directory
}
if isIPv6Test {
beforeFuncs = append(beforeFuncs, stepfuncs.KubectlCreateSumologicSecretOpt())
Copy link
Author

@jagan2221 jagan2221 Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Secret created manually in ipv6 clusters since we needed to disabled sumologic.setupEnabled flag to prevent installation failure due to terraform registry not reachable. Just "dummy" values added as secrets, refer sumologic-secrets.yaml

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The customer is going to deploy helm charts with the setup job enabled. Have you tried using https://github.com/kubernetes-sigs/nat64? If this isn't working I suggest we add manual setup/test instructions in the vagrant readme to test this locally.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried nat64, but didn't work

Copy link
Contributor

@rnishtala-sumo rnishtala-sumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabling the setup job to test ipv6 is not preferred. The setup job does more than managing secrets. Instead of implementing this workaround, we should consider adding manual test instructions to our vagrant docs if using dns64 or nat64 is complex.

@@ -116,13 +117,15 @@ func GetMetricsK8sattributes(expectedMetrics []string, metricsCollector MetricsC
Feature()
}

func GetMetricsFeature(expectedMetrics []string, metricsCollector MetricsCollector) features.Feature {
// addKubeVersionSpecificMetrics - adds metrics that are specific to the Kubernetes version to expectedMetrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not do this. Instead, there is a way to add coredns_cache_hits_total to an optional list. The list of expected metrics is maintained in this file - https://github.com/SumoLogic/sumologic-kubernetes-collection/blob/v4.13.0/tests/integration/internal/constants.go

"github.com/SumoLogic/sumologic-kubernetes-collection/tests/integration/internal/stepfuncs"
)

func FilterOutIPv6ExcludedMetrics(metrics []string) []string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this filter in favor of maintaining the list of expected metrics here - https://github.com/SumoLogic/sumologic-kubernetes-collection/blob/v4.13.0/tests/integration/internal/constants.go

@@ -43,7 +43,7 @@ func Test_Helm_Default_OT_NamespaceOverride(t *testing.T) {

featInstall := GetInstallFeature(installChecks)

featMetrics := GetMetricsFeature(expectedMetrics, Prometheus)
featMetrics := GetMetricsFeature(expectedMetrics, Prometheus, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, please undo

// We can't do it earlier, because we run the tests for different k8s versions
// and we can't fetch current version earlier
expectedMetrics = append(expectedMetrics, internal.GetVersionDependentMetrics(t)...)
if addKubeVersionSpecificMetrics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't need this anymore based on the above comment


// KubectlCreateSumologicSecretOpt returns a features.Func that will create a secret called "sumologic"
// in the namespace where the chart is installed using the YAML file.
func KubectlCreateSumologicSecretOpt() features.Func {
Copy link
Contributor

@rnishtala-sumo rnishtala-sumo Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be doing this outside the helm chart installation to test ipv6 support.

// HelmInstallTestOpt picks a values file from `values` directory
}
if isIPv6Test {
beforeFuncs = append(beforeFuncs, stepfuncs.KubectlCreateSumologicSecretOpt())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The customer is going to deploy helm charts with the setup job enabled. Have you tried using https://github.com/kubernetes-sigs/nat64? If this isn't working I suggest we add manual setup/test instructions in the vagrant readme to test this locally.

@@ -0,0 +1,24 @@
# This secret is used for ipv6 clusters where we disabled setup job which is responsible for creating the secret.
# Reason is the limitation of Github runners are ipv4 only. So setting up NAT64 would be very tricky and tedious and we also don't need backend url's for IT's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabling the setup job to test ipv6 is not preferred. The setup job does more than managing secrets. Instead of implementing this workaround, we should consider adding manual test instructions to our vagrant docs if using dns64 or nat64 is complex.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that setupjob could do multiple things. But there are multiple tests to test setup job features, but this test focuses on specifically IPv6 communication and helped me even catch the issue with previously proposed regex for scrape job(pod-annotations) and to forced me to fix it. Similarly i think this test will help us capture much issues coming in ipv6 setups.

In future , if we succeed settingup nat64, we can just remove secretcreation and enable setup job. Till then, IMO, this internal ipv6 test will help us to a good extent.

@jagan2221 jagan2221 requested a review from rnishtala-sumo July 3, 2025 16:06
Copy link
Contributor

@rnishtala-sumo rnishtala-sumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting that the IT be removed for now, until we have a solution for the setup job. This can live on a branch. Recommend that our vagrant (developer) docs be updated with instructions on testing this.

@jagan2221
Copy link
Author

Even vagrant test steps will be based on disabling setupJob and test helm chart further. Hope this would be fine for local test steps @rnishtala-sumo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants