Skip to content

Conversation

@zacharycmontoya
Copy link
Contributor

Motivation

The Test_Config_RuntimeMetrics_Enabled test cases have been flaky

Changes

Add retries to the get_runtime_metrics helper totaling over 10s of waiting, which is the standard runtime metrics reporting interval. This means the "runtime metrics disabled" tests will run for at least 10s, but it should make the "runtime metrics enabled" tests more reliable.

Workflow

  1. ⚠️ Create your PR as draft ⚠️
  2. Work on you PR until the CI passes
  3. Mark it as ready for review
    • Test logic is modified? -> Get a review from RFC owner.
    • Framework is modified, or non obvious usage of it -> get a review from R&P team

🚀 Once your PR is reviewed and the CI green, you can merge it!

🛟 #apm-shared-testing 🛟

Reviewer checklist

  • If PR title starts with [<language>], double-check that only <language> is impacted by the change
  • No system-tests internal is modified. Otherwise, I have the approval from R&P team
  • A docker base image is modified?
    • the relevant build-XXX-image label is present
  • A scenario is added (or removed)?

…aiting, which is the standard runtime metrics reporting interval. This means the "runtime metrics disabled" tests will run for at least 10s, but it should make the "runtime metrics enabled" tests more reliable.
@github-actions
Copy link
Contributor

CODEOWNERS have been resolved as:

tests/test_config_consistency.py                                        @DataDog/system-tests-core

@zacharycmontoya zacharycmontoya marked this pull request as ready for review November 18, 2025 21:29
@zacharycmontoya zacharycmontoya requested a review from a team as a code owner November 18, 2025 21:29
for _ in range(retry):
runtime_metrics_gauges = [
metric
for _, metric in interfaces.agent.get_metrics()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I've missed a point, when this code is executed, the agent container is down, and the data are loaded from files on filesystem, so adding a retry here should ne change anything.

The proper way to harden this test is

  1. get from the agent team the timeout where we consider a missing data being a bug
  2. then adds a call to interfaces.agent.wait_for() in the setup function, using this timeout as an argument.

wait_for takes two arguments :

  1. a callable that takes a data, which is a payload sent by the agent, and returns true to stop waiting
  2. a timeout function

wait_for does not returns anything, neither fail. it's the responsability of the callable to store whatever is needed to assertion.

A typical pattern will be :

def setup(self):
    self.metric_found = False

    def is_metric(data):
        self.metric_found = data["type"] == "metric"
        return self.metric_found

     interfaces.agent.wait_for(is_metric, 10)

def test(self):
    assert self.metric_found

Feel free to DM me if anything is not clear enough :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants