"safety mode" #7952
brianehlert
started this conversation in
Ideas
"safety mode"
#7952
Replies: 1 comment
-
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
It is not uncommon for a configuration change to result in an invalid object as a result of a configuration not being able to be applied.
This happens in spite of the layers of schema, business logic, and NGINX validation present within the system.
Optimally we want to catch errors through schema validation or business logic validation, but this is not always possible and the last chance for validation is NGINX itself when the configuration defined through the K8s API is realized as nginx.conf and attempted to be validated by NGINX.
There are a number of cases that can only be validated by NGINX itself, such as snippets, performance tuning directives that have a relationship to other directives (directive settings can clash), DNS resolvers are only valid at a point in time and can grow stale over time, etc.
The other side of this equation is that NGINX Ingress Controller lives in a dynamic environment. Its pods can come and go and move about the cluster. The pods of backend services can also come, go, and scale.
The rates of configuration change that is expected is in the second to sub second level. And the need to have a valid startup configuration is very high as new pods only have the K8s API as their source of truth for the configuration.
Unexpected outcomes can be experienced by operational behaviors such as cycling an ingress controller deployment after a configuration apply, that might not have been without that action.
Where this all becomes exceedingly complex is when a configuration object is set at invalid by the system. What should the system do?
The system relies on the K8s API and the objects it is watching to be the source of truth for the configuration.
When all of the defined objects are good, the end result is functional NGINX configuration, and everyone is happy.
When an object changes result in an invalid configuration, the system makes an attempt to isolate that error back to the individual object and writes an error back to the object.
(this cannot always be done well when multiple individual object changes happen in rapid succession, in this case the system optimizes for speed and can batch many changes into a single configuration change - no one wants their ingress controller to take minutes to start a pod or minutes to process through 100 or 1000s of configuration objects - this feedback was addressed years ago)
This invalid object means that it must be corrected by whomever is making the object change. Because that is the source of truth for new pods. Currently running pods will maintain the previous configuration in memory and continue to serve traffic.
We have had suggestions such as not applying invalid objects. Which is technically what the system does. But what the system does not do is maintain a previous state nor attempt to infer that any single moment of configuration is more "correct" than any other moment of configuration. Because the system assumes every pod is ephemeral and the desired configuration state is defined in the K8s API. This means that pod startup without any previous configuration knowledge must be assumed at all times.
It is possible to run a UAT deployment of NGINX Ingress Controller, 1 replica. Either in a separate cluster or the live cluster for no other purpose than to validate any object change. And we would recommend that in a separate namespace.
However, there is no automagic promotion of an object change from this namespace to the production namespace provided by the system. It is expected that this process would be pipeline driven and include a task that queries the object changed and inspect for errors.
We have received requests of a "validation webhook" which is fundamentally no different than what could be achieved with the above suggestion. It is still a behavior change to the pipeline or operator.
We don't currently have a better answer to this concern. And are open to suggestions that:
Beta Was this translation helpful? Give feedback.
All reactions