You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: make service creation timeouts in a cluster more robust (CloudNetService#1085)
### Motivation
The current handling of service starts has a small issue in the way
services are started:
```mermaid
flowchart TB
A["Head Node"] -->|Start Request| B["Node to start on"]
B -- Create Result --> A
B -->|Publish| C["All other components"]
```
The head node waits for the node to start the service for 5 seconds, and
if nothing happended will try to re-create the service or continue with
the normal ticking (which might trigger a new service start try).
However, the remote node is unaware that the service creation timed out
and will still register the service locally and publish its info the
cluster, which might lead to duplicate service creations.
### Modification
The new system is more aware of delays and handles the creation mistakes
much better:
```mermaid
flowchart TB
A1["Head Node"] -->Z1["Start request with timeout (20 seconds)"]
Z1-->|To target node| A2
C1["Head Node Register Try"]
C1 -->|"Success (Send by Head Node to Target Node)"| D2
C1 -->|Failure| G2
Z1 -->|Timeout| G2
C2-.->|"TTL exceeded"| G2
A2["Request received"]-->B2["Service created"]-->C2["Register to waiting services (TTL: 1 Minute)"]-->|Responds with Create Result| C1
D2["Removal from unaccepted services"]
D2-->|Success| E2["Register as local Service"]
D2-->|TTL exceeded| H2["Unregister from Head Node"]
G2["Auto remove"]
E2-->|Publish of Service Info| F2["All other components"]
```
With that way the head node takes full control over service creations
and no longer allows a node to do things independent from the head node.
That allows us to ensure that service registrations in the cluster
happen once, and only once without any side effects for later service
starts which are requested by the head node.
### Result
Services should no longer get registered as "ghosts" but only fully
controlled by the head node, and removed properly in case a service
creation timeout occurs.
##### Other context
FixesCloudNetService#994
0 commit comments