Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider the ability to opt-out from automatic functionality such as automatic provider ready events. #265

Open
kinyoklion opened this issue Jun 13, 2024 · 16 comments

Comments

@kinyoklion
Copy link
Member

For providers which support changing their provider status at runtime the automatic events introduce additional complexity.

The provider needs to defer its handling of the status until after the automatic status provided by the OpenFeature API. Otherwise it would emit duplicate provider ready or error events. The handling can be somewhat complex for multi-threaded providers and SDKs.

It would be nice if the provider had meta-data, or other fields, which could allow for opting out of the automatic behavior. This may then be extended to other automatic functionality in the future.

Related discussion: open-feature/dotnet-sdk#276 (comment)

@toddbaert
Copy link
Member

I feel conflicted about this.

On the one hand, I see the value and the simplicity that can be gained here for underlying SDKs that emit events. On the other hand we're adding more complexity to the SDK, and cognitive load to provider authors.

One other question is I have is if this setting would also apply to if the init throws. Would we have some other indicator for that to prevent the same issue for duplicate error events?

@kinyoklion
Copy link
Member Author

cognitive load to provider authors

I am not sure if this is true or not. I recently updated the LD node, python, java, and dotnet providers to handle this behavior and handling the mixed requirements of runtime events, initialization events, and the blocking behavior is by far the largest cognitive load in any of these provider implementations (aside from the node one, where we happen to have an underlying mechanism that matches well), and also very easy to get incorrect.

This optionality may or may not be the correct way to reduce the complexity. Is optimizing for the simple case the correct approach, or is it likely that this increases complexity more often than it decreases it?

Ultimately here the event channel is shared. The provider can produce events and the OF SDK can also produce events. There isn't a delineation of the types of events provided by both. So you have to encode into the provider that specific events are emitted automatically at specific points, and that those events also need to be emitted by the provider at different points.

@toddbaert
Copy link
Member

I think I'm nearly convinced. What's your take on this?

One other question is I have is if this setting would also apply to if the init throws. Would we have some other indicator for that to prevent the same issue for duplicate error events?

Do you think we should have 2 settings? 1 for preventing emission of error and 1 for emission of ready? Or can we handle both with one?

Do you think this should be a field on ProviderMetadata? Or should we have a dedicated field on the provider interface itself for it?

cc @lukas-reining since I think some of his recent work with a config-cat provider runs into this exact issue.

@lukas-reining
Copy link
Member

lukas-reining commented Jun 25, 2024

Having this data on the metadata feels pretty complicated to me @toddbaert @kinyoklion.

The initialization function is always async in JS and in other languages often too, or it is considered sync and blocking, so could we just prohibit the Ready event from being emitted by the init method?
In this case we would document that a resolving init function always triggers the automatic dispatch of a ready event.
Emitting it yourself during the init could the be considered a bug. And we could relatively simple (hope to not oversee something) even filter these events out during provider initialization and log that this should not be done.

I think this way we could still have the automatic event and prevent the double events. But I might be overseeing something.

@kinyoklion
Copy link
Member Author

Having this data on the metadata feels pretty complicated to me @toddbaert @kinyoklion.

The initialization function is always async in JS and in other languages often too, or it is considered sync and blocking, so could we just prohibit the Ready event from being emitted by the init method? In this case we would document that a resolving init function always triggers the automatic dispatch of a ready event. Emitting it yourself during the init could the be considered a bug. And we could relatively simple (hope to not oversee something) even filter these events out during provider initialization and log that this should not be done.

I think this way we could still have the automatic event and prevent the double events. But I might be overseeing something.

The problem is that the underlying means of emitting the events is typically independent of knowing when initialization is complete for the underlying SDK, and that many languages have not had good async support for as long as things like JavaScript (or at all).

So you have blocking or non-blocking init (maybe by just setting a short timeout), and then you have a means to listen to events. In which case you have to use those events to orchestrate a return from the init method using threading events or other means (sometimes it can just be a task that gets resolved).

The python one I was using in the original example: https://github.com/launchdarkly/openfeature-python-server/blob/243a8dc515bcd16fe8c7d5a0913fbdfe6e073aa0/ld_openfeature/provider.py#L46

Another, but Java:
https://github.com/launchdarkly/openfeature-java-server/blob/c6b2edff3f1b33cda0117aadde7aa52c67b61dbc/src/main/java/com/launchdarkly/openfeature/serverprovider/Provider.java#L140

It isn't just this behavior, but a number of things combine to make the SDK initialization much more complex than seems reasonable.

Emitting it yourself during the init could the be considered a bug.

When the events being emitted by the underlying SDK are threaded it is a bit tricky to say what happens during init. Which is demonstrated in the approach I took in the Java version.

Some of these things can be addressed by extending the functionality of the underlying vendor SDK. This will probably be what happens with the LaunchDarkly SDKs if the interface remains this way, but it does create additional barriers to adoption.

@lukas-reining
Copy link
Member

lukas-reining commented Jun 27, 2024

So you have blocking or non-blocking init (maybe by just setting a short timeout), and then you have a means to listen to events. In which case you have to use those events to orchestrate a return from the init method using threading events or other means (sometimes it can just be a task that gets resolved).

Sure, but isn't this the typical way of handling it?
Many SDKs have an async or blocking init function. For these, implementing an init function that only resolves when ready or blocks until it is ready, is trivial.
If they only provide events, handlers can be attached before initialization and then the first "READY like" event can be used to resolve/unblock the init function.
Which I would say is what you show in the examples.

Emitting it yourself during the init could the be considered a bug.

Then the events being emitted by the underlying SDK are threaded it is a bit tricky to say what happens during init. Which is demonstrated in the approach I took in the Java version.

If I am not overseeing anything this can be simply handled by only listening to the first READY event of the SDK.
Every event after that could be replayed to the OpenFeatureEventEmitter of the respective provider.

I see the point about errors, but for most languages this might be handled by the OF SDK handling any error returned/thrown by the provider init method. This error can then be added to the ERROR event that a the SDK emits.
Probably I am overseeing something but I think this is all relatively easy to handle without the complexity of more options and "behaviours".

Do you have an example where these do not work @kinyoklion?

@kinyoklion
Copy link
Member Author

kinyoklion commented Jun 27, 2024

Everything can be made to work. So I have no examples where it doesn't work. My problem is with the level of complexity required. If generally this is considered an acceptable level of complexity, then there isn't much to add. (Basically I updated 3 of these in a row, and each one was very fiddly.)

I am having some trouble finding things to compare to because the majority of providers seem to not support init and shutdown or not support events or not support either, or they have not been updated to the latest SDKs. (Or they are JS SDKs which is much less of an issue, though I still need to update our node provider because I have not added runtime provider status, but it is very straightforward in a single-threaded environment.)

I do feel though that there are going to be a number of bugs for providers that do implement each of these functionalities. Because the most intuitive way to implement them, which is hook up an event handler and wait for initialization, isn't correct. In that it works fine, but does result in duplicate events.

@lukas-reining
Copy link
Member

lukas-reining commented Jun 28, 2024

Everything can be made to work. So I have no examples where it doesn't work. My problem is with the level of complexity required. If generally this is considered an acceptable level of complexity, then there isn't much to add. (Basically I updated 3 of these in a row, and each one was very fiddly.)

I have seen in the examples you showed that these are also having the internal provider state still.
In my view, having the "client managed" state and only this reacting to the init function and events after it is relatively low effort.
The providers you showed are doing basically what I have seen in many applications using the SDKs directly.

Or they are JS SDKs which is much less of an issue, though I still need to update our node provider because I have not added runtime provider status, but it is very straightforward in a single-threaded environment.

I am having a hard time with this. JS is still concurrent and depending on what happens in the background, events might be emitted out of order at any time. If it really is a multi-threaded lang and things are happening at the same time we would have to protect it with a mutex or so, but I do not see a real difference between the concurrent and parallel case for this.
But I could totally oversee anything so please correct me if I am overseeing something.

Because the most intuitive way to implement them, which is hook up an event handler and wait for initialization, isn't correct. In that it works fine, but does result in duplicate events.

READY events emitted by the provider before initialization has finished could simply be ignored by the SDK. I think this would be pretty solid.

I fear that if we have those options for opting out of automatic events, there will be many errors because it is just so easy to oversee.

@kinyoklion
Copy link
Member Author

I am having a hard time with this. JS is still concurrent and depending on what happens in the background, events might be emitted out of order at any time.

In a concurrent language you can check a state and register a handler and there is no race condition as long as you do not await as a micro-task cannot be executed between those. In a multi-threaded system they can, and the state that needs locked is internal to the underlying SDK. The event dispatch itself may be many threads.

Which is why you have to introduce additional synchronization.

In regards to de-duplicating the event may be in a thread that has not gotten scheduled. So it actually emits after init. (Assuming the provider didn't use the event to complete the init function.)

So it may be a more general de-duplicate ready when already ready.

@kinyoklion
Copy link
Member Author

The providers you showed are doing basically what I have seen in many applications using the SDKs directly.

The correct way to use the LD Java SDK directly is this:

 final LDClient client = new LDClient(SDK_KEY, config);
boolean flagValue = client.boolVariation(FEATURE_FLAG_KEY, context, false);

It has a blocking constructor that waits up to 10 seconds by default. You can make it longer if you want it to be longer. You can also independently track the current status, in case you need to know when it encounters problems. There is no need to use the status to wait for initialization. If you want non-blocking initialization you can set the time to wait to 0, and then handle things on flag changes or status changes.

I think I am going to change the LaunchDarkly providers back to just using the blocking constructor. I don't think it truly matches the spirit of the setProvider/setProviderAndAwait methods, but I know it is safe and predictable and keeps provider simplicity low.

@lukas-reining
Copy link
Member

lukas-reining commented Jun 29, 2024

In a concurrent language you can check a state and register a handler and there is no race condition as long as you do not await as a micro-task cannot be executed between those.

But the point is that it is most likely that there will be an await happening so race conditions also have to be expected here.

In a multi-threaded system they can, and the state that needs locked is internal to the underlying SDK. The event dispatch itself may be many threads. Which is why you have to introduce additional synchronization.

Yes, but to me it would be fine if the SDK implementation had a critical path between receiving the first READY event and setting the provider state in the client, which can be synced with a mutex or so.
I think shifting the responsibility to the SDK mitigates the problem in this case.

In regards to de-duplicating the event may be in a thread that has not gotten scheduled. So it actually emits after init. (Assuming the provider didn't use the event to complete the init function.)

This is a really good point. Maybe this can not really be solved elegantly.
So we really have a problem if an SDK has a blocking/awaited async init method that is simply called in the provider init, but also provides events.

One potentially could in these cases always filter out READY events that happened before the init finished AND if none happened until the provider init finished, the first ready event of the provider regardless of the provider status.
But maybe there is also an edge case I am overseeing.
What do you think @kinyoklion?

If we can not solve it in i good way, I guess having the metadata options seems okay.
I just don't like having flags changing provider and SDK providers that will definitely be confused or set wrong and also cause bugs.
Maybe in this case we could even consider not having automatic events, but this brings whole new challenges.

I think I am going to change the LaunchDarkly providers back to just using the blocking constructor. I don't think it truly matches the spirit of the setProvider/setProviderAndAwait methods, but I know it is safe and predictable and keeps provider simplicity low.

Because listening to events is not reliable or what is the reasoning?

@kinyoklion
Copy link
Member Author

Because listening to events is not reliable or what is the reasoning?

Regardless what the correct solution is here I don't think that the current solution has the correct level of complexity and safety. So I am considering using what I know is the safest path for someone consuming the LD SDK, which is also the simplest way of using it. (Safety here mainly being, initialization occurs within some known bounded time. Secondarily has logical events.)

Though I don't know that I can actually make this change upon review. I recently released 1.0 versions of our providers after adding init and shutdown. I would consider changing this a breaking behavior (In most cases the SDK is initialized by the time it hits the initialize method, but sometimes it wouldn't be, and in that case the behavior would be different).

To re-iterate, I think that having an option is one solution to this. But the problem I think is multiple owners of the same data and responsibilities. To which there could be other solutions.

The option allows it to be either fully controlled by the provider or fully controlled by the OF SDK (for simple providers). If it was always fully controlled by the OF SDK always, even for runtime changes, that would be fine as well. In that case the provider would be informing the SDK of changes in state and the SDK would decide which ones to emit based on the current state. Which, as it is also tracking the state, it has something to lock and compare. It also requires an ordering guarantee of the underlying SDK events, which most SDKs should have or it would make them hard to use.

@lukas-reining
Copy link
Member

lukas-reining commented Jul 4, 2024

The option allows it to be either fully controlled by the provider or fully controlled by the OF SDK (for simple providers). If it was always fully controlled by the OF SDK always, even for runtime changes, that would be fine as well. In that case the provider would be informing the SDK of changes in state and the SDK would decide which ones to emit based on the current state. Which, as it is also tracking the state, it has something to lock and compare. It also requires an ordering guarantee of the underlying SDK events, which most SDKs should have or it would make them hard to use.

Definitely see the point.

Regardless what the correct solution is here I don't think that the current solution has the correct level of complexity and safety.

I see the point and totally support that.
Regarding the solution I would love to know what others think besides you and me.
Maybe @toddbaert and @thomaspoignant as the rest of the TC could give their opinion.
cc @Kavindu-Dodan from the GO perspective could also help.

@toddbaert
Copy link
Member

toddbaert commented Jul 4, 2024

OK I'm going to try to summarize so far.

  • The OpenFeature SDK automatically emits READY/ERROR events according to the termination of the initialize function
  • Providers support the emission of events, which frequently, are "wired up" to the native event emission mechanisms of the underlying vendor SDK (so that when the underlying SDK emits an event, the provider does as well).
  • In some cases, vendor SDKs must be initialized, and when this completes, they emit their own equivalent of a READY event.
  • This leads to a situation where a provider author wires up the underlying SDK's events, and initializes that SDK, resulting in 2 READY event emissions, one when the underlying event fires from the vendor SDK, and one when the initialize function returns.

This is a problem. We can approach this in one of two ways:

  • Require that provider authors writing providers for SDKs such as this block the initialize method until the first underlying READY event is fired, and prevent this event from being propagated (ie: refrain from "wiring up" events until the first READY from the underlying SDK is received).
    • This is what we're doing at the moment. We've done basically this for our Java and .NET flagd providers which use gRPC streams that in many ways feel similar to using an underlying SDK event stream
  • Somehow allow providers to declare to the SDK that they "wired up" events, and so there's no need for the SDK to emit any sort of READY when initialize returns
    • this removes the need for provider authors to "prevent propagation" of events, but I don't think it greatly reduces the complexity. In blocking paradigms (like Java) it's still expected that the provider is in a usable state when the initialize method returns. The SDK doesn't "wait" for an event, so authors would still have to block the init until the first event is received from the underlying SDK.

I'm open to either direction here, but I think I slightly favor keeping things as they are. @Kavindu-Dodan @thomaspoignant please weigh in.

Keep in mind, one of the values of the project is to keep things simple and consistent for application authors, even if it comes at the cost of complexity for provider authors.

@Kavindu-Dodan
Copy link
Contributor

Kavindu-Dodan commented Jul 4, 2024

Interesting discussion. I think the underlying issue is there even in flagd provider. we have some special logic to avoid double ready events. For example, in Java [1] we use custom state change and in Go [2], the provider internally omits the initial ready event.

Somehow allow providers to declare to the SDK that they "wired up" events

I think this could solve the provider implementation complexity and simplify the initialization ready (or error) events.

@Kavindu-Dodan from the GO perspective could also help.

In go, we have an event channel that needs to be exposed by the provider if it implements eventing. This has some advantages. But still, like in flagd Go provider [2], special logic is needed to filter events for startup.

Keep in mind, one of the values of the project is to keep things simple and consistent for application authors, even if it comes at the cost of complexity for provider authors.

If we provide an option to skip startup events, then complexity is at SDK maintainers. But I think this is fine. Eventing implementations are already somewhat complex :)

[1] - https://github.com/open-feature/java-sdk-contrib/blob/main/providers/flagd/src/main/java/dev/openfeature/contrib/providers/flagd/FlagdProvider.java#L160-L163
[2] - https://github.com/open-feature/go-sdk-contrib/blob/main/providers/flagd/pkg/provider.go#L101-L110

@kinyoklion
Copy link
Member Author

Keep in mind, one of the values of the project is to keep things simple and consistent for application authors, even if it comes at the cost of complexity for provider authors.

As a value proposition I think that complexity in the OF SDK is better than provider complexity which is better than application author complexity.

Solving the problem fewer times will generally mean fewer, or at least more easily remedied, problems for application authors. It will also mean fewer footguns for provider authors. In this case the proposal here doesn't remove a footgun, it basically lets you toggle the footgun off after you get shot with it. Which is why it is still not ideal, but I am not sure of a better solution.

The SDK doesn't "wait" for an event, so authors would still have to block the init until the first event is received from the underlying SDK.

The change this makes in the LaunchDarkly provider is that it would remove shared state and just allow setting a thread monitor (or completeable future) in a single event handler unconditionally and also emitting the event. It removes the decision about also emitting that event based on state. So it appreciably decreases complexity.

It also provides better behavior in the more simplified case (which I will probably switch to still). The blocking constructor could timeout, we attach up event handlers, and then throw. If the SDK does eventually initialize, then it can emit a ready event. The first ready/error event doesn't have any special behavior to consider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants