Implement graceful shutdown #236

nhoad · 2022-08-28T00:23:51Z

Hi! This PR attempts to implement per the recommendation in this comment #91 (comment). The commit message provides more detail on exactly

Note that one distinct difference from the suggested implementation is that the listener socket is closed after the backlog is drained. For my use case (running many instances of my HTTP server using SO_REUSEPORT) this works better, as it will ensure that any new connections will go to other instances ASAP and ensure this given instance can shutdown relatively quickly. Based on further discussion on the issue, I think this would be the most beneficial behavior for others too. If you'd like that behavior to be configurable as well, just let me know and I can add another config flag for it.

(hopefully) fixes #91.

When WANT_GRACEFUL_SHUTDOWN is defined, the following events will happen on SIGTERM: 1. Drain the backlog on the listening socket by accepting as many connections as it can. Previously they were likely to get a connection reset error. 2. Close the listening socket so no more clients can connect. 3. Check every 100ms if there are any active connections. An active connection is defined as a connection that is part-way through handling a request. By not counting keep-alive connections, this prevents bjoern from living longer than necessary. 4. If after 30 seconds there are still active connections, shutdown anyway. Also, after receiving SIGTERM, the server will stop keeping connections alive.

- graceful-shutdown-closes-socket tests that the socket is closed upon a SIGTERM being received. - graceful-shutdown-accept-backlog tests that any pending connections are accepted. - graceful-shutdown-timeout to show that active connections will be dropped after 30 seconds. - graceful-shutdown-keep-alive to show that inactive-but-open connections will not block shutdown.

jonashaag

Terrific work! Thanks so much for working on this!

jonashaag · 2022-08-30T10:49:49Z

bjoern/server.c

  ev_signal_stop(mainloop, watcher);
 #ifdef WANT_SIGNAL_HANDLING
  ev_timer_stop(mainloop, &timeout_watcher);
 #endif
+#ifdef WANT_GRACEFUL_SHUTDOWN
+  ev_signal_stop(mainloop, &thread_info->sigterm_watcher);
+  ev_timer_stop(mainloop, &timeout_watcher);


Is it safe to stop the same timer twice?

Btw, do you know if it is possible to receive a timeout_watcher callback call during ev_signal_on_sigxxx? If so, is it a problem? Shall we move stopping the timeout watcher to the very top?

Good point! I'll move the timeout watcher up to the top and shuffle the code around so that it only gets stopped once.

From the documentation for ev_TYPE_stop here http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod, it says "Stops the given watcher if active, and clears the pending status (whether the watcher was active or not)." so it looks like it's safe to stop it more than once. Because it also clears the pending status means that we shouldn't receive a timeout_watcher callback during this handler either.

bjoern/server.c

jonashaag · 2022-08-30T11:07:50Z

bjoern/_bjoernmodule.c

@@ -11,6 +11,10 @@ static PyObject*
 run(PyObject* self, PyObject* args)
 {
  ServerInfo info;
+#ifdef WANT_GRACEFUL_SHUTDOWN
+  info.active_connections = 0;


Is there a built in way in libev to handle connection draining so that we don't have to keep a counter? I'm worried about counting bugs.

Unfortunately no, going through the documentation I can't see any way to do that. In case it helps, I went through a couple of iterations of the counting code and this was by far the simplest one I came up with - counting connections as "active" once http-parser tells us a message has begun, and then not counting them once we're done with the sending the response.

I contemplated doing something like... storing all of the requests in an array/linked list/hash map on the ThreadInfo struct, and then basically watching for that to drop to either drop to zero entries, or if the existing entries aren't handling any connections. That would eliminate counting bugs, but it would also be significantly more complex as we'd need to handle growing the array occasionally if there's an influx of requests. I think we'd also need to store some additional state on the Request struct for whether or not it's currently handling a request. Given the complexity, that's why I decided to go for the simpler approach :)

That said though, I think it's worth doing some testing on dropping connections halfway through requests at various points in the lifecycle to see how the counting holds up. I'll go back and do that, and report on my findings :)

My testing found something - aborted connections weren't being accounted for. I've fixed that up and added some tests for it as well.

Why not implement counting as follows:

A successful accept increments the counter

A close decrements the counter

So, move to ev_io_on_requests and close_connection?

We could do that! That would certainly be simpler. The downside to doing that is that any persistent connections that aren't being actively used will hold the event loop open until the timeout is reached. Is that acceptable?

…quests.

…e loop.

…in the SIGINT handler.

…d connections.

bjoern/server.c

jonashaag · 2022-08-31T12:33:32Z

bjoern/_bjoernmodule.c

@@ -11,6 +11,10 @@ static PyObject*
 run(PyObject* self, PyObject* args)
 {
  ServerInfo info;
+#ifdef WANT_GRACEFUL_SHUTDOWN
+  info.active_connections = 0;


Why not implement counting as follows:

A successful accept increments the counter

A close decrements the counter

So, move to ev_io_on_requests and close_connection?

bjoern/server.c

jonashaag · 2022-08-31T12:36:09Z

bjoern/server.c

+  ev_signal_stop(mainloop, watcher);
+  ev_timer_stop(mainloop, &timeout_watcher);
+
+  // don't shutdown immediately, but start a timer to check if we can shutdown in 100ms.


Do we still need this if we change the counting to what I suggested above? It feels like this timeout stuff is unnecessarily complicated

We would still need it, yes - without this there's nothing checking if there are any active connections, so it will always take 30 seconds to shutdown (i.e. the other timeout would be reached).

Thinking a little bit more on this, I could remove the ev_shutdown_timeout_ontick function and associated timer if I did the following:

When SIGTERM is received, store ev_now(loop) + 30 in an ev_tstamp on the server

Make ev_shutdown_ontick check if shutdown_time <= ev_now(loop), and use that to stop the loop.

That feels a bit simpler to me than timeout juggling it's currently doing.

jonashaag · 2022-12-29T09:13:16Z

@nhoad are you planning to work on this some more?

nhoad · 2023-01-12T16:54:19Z

I'm happy with the branch as it is, but if you'd like me to make any changes just let me know 😄

nhoad added 2 commits August 27, 2022 20:05

jonashaag reviewed Aug 30, 2022

View reviewed changes

nhoad added 7 commits August 30, 2022 08:58

We have no need for the GIL here.

6e11ab9

A THREAD_INFO macro tidies this up quite a bit.

9eca1e5

Rename ev_io_on_request to indicate that it now may start multiple re…

d25535c

…quests.

Acquiring the GIL is pretty expensive, so let's acquire it outside th…

28e613b

…e loop.

Only stop the timeout_watcher once, and stop it as early as possible …

68e5ead

…in the SIGINT handler.

When a connection is aborted we shouldn't count it any more.

fc8157b

Add some tests to make sure graceful shutdown still works with aborte…

47cd1a5

…d connections.

jonashaag reviewed Aug 31, 2022

View reviewed changes

nhoad added 2 commits August 31, 2022 08:49

Whoops, if we fail the fcntl call we still need to unlock the GIL.

2fd517d

Let's move all of our stops to the front, so they're all together.

5e31579

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement graceful shutdown #236

Implement graceful shutdown #236

nhoad commented Aug 28, 2022

jonashaag left a comment

jonashaag Aug 30, 2022

jonashaag Aug 30, 2022

nhoad Aug 30, 2022

jonashaag Aug 30, 2022

nhoad Aug 30, 2022

nhoad Aug 30, 2022

jonashaag Aug 31, 2022

nhoad Aug 31, 2022

jonashaag Aug 31, 2022

jonashaag Aug 31, 2022

nhoad Aug 31, 2022

nhoad Sep 11, 2022

jonashaag commented Dec 29, 2022

nhoad commented Jan 12, 2023

Implement graceful shutdown #236

Are you sure you want to change the base?

Implement graceful shutdown #236

Conversation

nhoad commented Aug 28, 2022

jonashaag left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonashaag commented Dec 29, 2022

nhoad commented Jan 12, 2023