Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add non-exiting methods and implement SIGHUP reload #405

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

samurai00
Copy link
Contributor

Hey team! I've tried to make a few small tweaks to our server to hopefully improve its flexibility and add some reload functionality. Here's what's new:

  • Added pub fn try_bootstrap(&mut self) -> Result<bool> that doesn't call exit
  • Refactored pub fn bootstrap(&mut self) to use the new try_bootstrap method
  • Added pub fn run_server(mut self, enable_daemon: bool) -> Result<bool> that doesn't call exit
  • Kept pub fn run_forever(mut self) -> ! the same, but now it calls run_server under the hood
  • Implemented SIGHUP catching for reloading
  • Added an example server_reload to show how to reload within a tokio runtime

Related issues:

@eaufavor
Copy link
Member

I think it is nice to break down the functions so that it is more flexible.

On the other hand, the reload in the example has a downtime (the old service shuts down before the new service starts). Do you have a plan to address that?

To me, the way of graceful reload is the following
when the reload signal arrives:

  1. start the new service and wait for the listeners to be passed to it
  2. signal the old service to gracefully exit
  3. wait for grace_period_seconds before fully stopping/killing the old service

The current run_server() does 2) + 3) before returning for the example to do 1). So there is a window that the service appear to be offline.

@samurai00
Copy link
Contributor Author

I agree that the new service should be started before shutting down the old one, and I thought I had attempted to do this in the example. After conducting some simple tests, it seemed to work as expected.

However, more thorough testing might be necessary to ensure everything is functioning correctly. I'll conduct some additional tests to verify this.

@johnhurt johnhurt self-assigned this Oct 4, 2024
@samurai00
Copy link
Contributor Author

I think it is nice to break down the functions so that it is more flexible.

On the other hand, the reload in the example has a downtime (the old service shuts down before the new service starts). Do you have a plan to address that?

To me, the way of graceful reload is the following when the reload signal arrives:

  1. start the new service and wait for the listeners to be passed to it
  2. signal the old service to gracefully exit
  3. wait for grace_period_seconds before fully stopping/killing the old service

The current run_server() does 2) + 3) before returning for the example to do 1). So there is a window that the service appear to be offline.

After trying, I found that in the example, the actual behavior is:

  1. signal the old service to gracefully exit
  2. start the new service and wait for the listeners to be passed to it

These two steps can be considered to occur simultaneously.

If the old service doesn't start to gracefully exit, the call to bootstrap() during the upgrade will fail, preventing the new service from actually starting.
It seems there wouldn't be much difference if we notify the old service to start gracefully exiting before calling bootstrap().

From analysis and logs, it appears that in non-high concurrency situations, from the time the old service begins to gracefully exit until the new service successfully starts, requests can continue to work without downtime.

Logs:

[2024-10-08T03:17:27Z INFO  pingora_core::server] SIGHUP received, sending socks and gracefully reloading
[2024-10-08T03:17:27Z INFO  pingora_core::server] Trying to send socks
[2024-10-08T03:17:27Z WARN  pingora_core::server::transfer_fd] server not ready, will try again in 1s
[2024-10-08T03:17:27Z INFO  pingora_core::server] Bootstrap starting
[2024-10-08T03:17:27Z ERROR pingora_core::server::transfer_fd] No incoming socket transfer, sleep 1s and try again
[2024-10-08T03:17:28Z DEBUG server_reload::app::echo] request count: 109
[2024-10-08T03:17:28Z DEBUG server_reload::app::echo] request count: 110
[2024-10-08T03:17:28Z DEBUG server_reload::app::echo] request count: 111
[2024-10-08T03:17:28Z DEBUG server_reload::app::echo] request count: 112
[2024-10-08T03:17:28Z INFO  pingora_core::server] listener sockets sent
[2024-10-08T03:17:28Z INFO  pingora_core::server] Bootstrap done
[2024-10-08T03:17:28Z INFO  pingora_core::server] Server starting
[2024-10-08T03:17:28Z DEBUG server_reload::app::echo] request count: 113
... similar debug logs omitted ...
[2024-10-08T03:17:33Z DEBUG server_reload::app::echo] request count: 136
[2024-10-08T03:17:33Z INFO  pingora_core::server] Broadcasting graceful shutdown
[2024-10-08T03:17:33Z INFO  pingora_core::server] Graceful shutdown started!
[2024-10-08T03:17:33Z INFO  pingora_core::server] Broadcast graceful shutdown complete
[2024-10-08T03:17:33Z INFO  pingora_core::server] Graceful shutdown: grace period 5s starts
[2024-10-08T03:17:33Z INFO  pingora_core::services::listening] Shutting down 0.0.0.0:6145
[2024-10-08T03:17:33Z INFO  pingora_core::server] service exited.
[2024-10-08T03:17:33Z DEBUG server_reload::app::echo] request count: 137
... similar debug logs omitted ...
[2024-10-08T03:17:38Z DEBUG server_reload::app::echo] request count: 160
[2024-10-08T03:17:38Z INFO  pingora_core::server] Graceful shutdown: grace period ends
[2024-10-08T03:17:38Z INFO  pingora_core::server] Waiting for runtimes to exit!
[2024-10-08T03:17:38Z DEBUG server_reload::app::echo] request count: 161
... similar debug logs omitted ...
[2024-10-08T03:17:43Z DEBUG server_reload::app::echo] request count: 184
[2024-10-08T03:17:43Z INFO  pingora_core::server] All runtimes exited, exiting now
[2024-10-08T03:17:43Z DEBUG server_reload::app::echo] request count: 185
[2024-10-08T03:17:43Z INFO  server_reload] Reload: true
...

@samurai00 samurai00 force-pushed the reload-signal branch 2 times, most recently from 951fbfc to b9a6827 Compare November 4, 2024 03:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants