Skip to content

Conversation

@alexandraoberaigner
Copy link
Contributor

@alexandraoberaigner alexandraoberaigner commented Dec 1, 2025

This PR

adds configurable retry backoff options to the flagd in-process provider, which allows to configure the retry policy of gRPC connections.

Aims to make the implementation consistent with Java. See this change to prevent tight loops when the retry policy doesn't take effect

Related Issues

Details

Configuration and API Enhancements:

  • Added RetryBackoffMs and RetryBackoffMaxMs to ProviderConfiguration, with corresponding default values, environment variable support, and provider options (WithRetryBackoffMs, WithRetryBackoffMaxMs). These allow users to configure retry backoff durations for connection retries. [1] [2] [3] [4] [5] [6]
  • Updated the in-process sync and service layers to accept and use the new retry backoff configuration values, passing them through all relevant constructors and methods. [1] [2] [3] [4]

Retry Logic and Policy:

  • Modified the gRPC retry policy builder to use the configured backoff values for InitialBackoff and MaxBackoff instead of hardcoded durations, making retry behavior fully configurable.
  • Updated the sync retry logic to log and sleep for the configured maximum backoff duration after a failed sync cycle, improving observability and control.

Testing and Validation:

  • Added a new unit test (TestBuildRetryPolicy) for the retry policy builder, verifying that the generated JSON policy reflects the configured backoff values and other retry parameters.
  • Updated integration and unit tests to include the new retry backoff options, ensuring end-to-end coverage. [1] [2] [3] [4] [5]

@alexandraoberaigner

This comment was marked as outdated.

@alexandraoberaigner alexandraoberaigner marked this pull request as ready for review December 9, 2025 14:17
@alexandraoberaigner alexandraoberaigner requested review from a team as code owners December 9, 2025 14:17
Copy link
Member

@aepfli aepfli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not the biggest fan of a sleep, as this will block everything. even if do some kind of context cancellation and want to stop execution. we are stopped by a pretty long Sleep. Hence i think we should really try to use a different construct than a sleep.

The rest of the pull request looks good to me, and is ready to merge.

Disclaimer: For me the sleep is a reason to not approving this, but not a big enough reason to block this pull request from getting merged, if the rest of the approvers are fine with this.

}

// WithRetryBackoffMs sets the initial backoff duration (in milliseconds) for retrying failed connections
func WithRetryBackoffMs(retryBackoffMs int) ProviderOption {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A parameter that represents a duration should be of type time.Duration. The name of the parameter should be changed to remove any reference to a specific unit like milliseconds.

}
}

cfg.RetryGracePeriod = getIntFromEnvVarOrDefault(flagdGracePeriodVariableName, defaultGracePeriod, cfg.log)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use time.ParseDuration to parse durations.

@toddbaert
Copy link
Member

toddbaert commented Dec 10, 2025

I am still not the biggest fan of a sleep, as this will block everything. even if do some kind of context cancellation and want to stop execution. we are stopped by a pretty long Sleep. Hence i think we should really try to use a different construct than a sleep.

The rest of the pull request looks good to me, and is ready to merge.

Disclaimer: For me the sleep is a reason to not approving this, but not a big enough reason to block this pull request from getting merged, if the rest of the approvers are fine with this.

Ya, personally I'm fine with it. We already have the same pattern in Java, so I would go with it for now. This sleep is only during what's already an unusual error scenario (connection is healthy, but stream keeps returning errors immediately), within a retry loop, so I think it's acceptable.... but looks like @alexandraoberaigner already used a timer.

Copy link
Member

@toddbaert toddbaert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a recommended comment for why we are doing our own weird simple backoff for unusual situations. I also agree with @sahidvelji 's duration comment.

Otherwise LGTM.

Co-authored-by: Todd Baert <[email protected]>
Signed-off-by: alexandraoberaigner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants