Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lost sockets #353

Open
pontscho opened this issue Jan 11, 2023 · 13 comments
Open

Lost sockets #353

pontscho opened this issue Jan 11, 2023 · 13 comments

Comments

@pontscho
Copy link
Contributor

pontscho commented Jan 11, 2023

Hi,

I think I found two edge cases when lwan loose sockets. As usual I test my web server with that tool I mentioned in #348.

  1. In that case when I start just one non-keepalive HTTP call and that socket is closing non-gracefully (for example the client loosing its network connection or binary is aborted, etc), in that case in this snippet:
        int timeout = turn_timer_wheel(&tq, t, epoll_fd);
        int n_fds = epoll_wait(epoll_fd, events, max_events, timeout);

the turn_timer_wheel() will return -1 and epoll_wait() will waiting "forever" (practically until the next call). This isn't a big bug, but if the client code (for ex. a Lua script) have to react (I've modified the unref function in lwan-mod-lua.c to call a function when coro aborts) to this event can't able to do this until something doesn't happening with that listening socket.

  1. In that case the connection went offline immediately after it accepted by lwan (e.g. client lost its network connectivity, or it's aborted, etc), lwan wasn't able to create a coro for that new connection and lwan won't close it properly because the timeout queue doesn't knows anything about this "live" connection.

So, something like this:

        int fd = accept4(listen_fd, NULL, NULL, SOCK_NONBLOCK | SOCK_CLOEXEC);

accepts the connection, and the connection died immediately the

            if (!conn->coro) {
                if (UNLIKELY(!spawn_coro(conn, &switcher, &tq))) {
                  ...
                }
            }

won't start a new corutin and the timeout_queue_insert() call in the spawn_coro() never adds this connection to the timeout queue.

Just for testing and debugging I've added two workarounds:

  1. turn_timer_wheel() returns with 1000 instead of minus 1
  2. I hacked the socket into the timeout queue after it accepted
    Both workarounds solves those problems above.

Unfortunately both cases are appearing in the wild after we started using our lwan based media server in production.

Thanks!
pontscho

@lpereira
Copy link
Owner

lpereira commented Jan 11, 2023 via email

@pontscho
Copy link
Contributor Author

pontscho commented Jan 11, 2023

I feared this would happen when I removed the eventfd that nudges the epoll back when I put multiple listener sockets. Can you confirm that the thread that is accepting the connection is the same that would handle that socket? If they're not the same, then the scenario you described is plausible.

Not sure yet, but I saw clues about it in my log. I planned to extend those printfs with thread id tomorrow but this multi-thread-access scenario explains some behaviours in the logs.

Also, I had no clue you were running Lwan in production! Is that something public?

Not yet, but if you can give me another access (email, etc) you will be surprised. :)

@lpereira
Copy link
Owner

lpereira commented Jan 11, 2023 via email

@lpereira
Copy link
Owner

Could you please try applying this (crude) patch and see if it fixes the issue? If so, then this should confirm my hypothesis:

diff --git a/src/lib/lwan-thread.c b/src/lib/lwan-thread.c
index e6d2352b..b154f9ca 100644
--- a/src/lib/lwan-thread.c
+++ b/src/lib/lwan-thread.c
@@ -820,6 +820,10 @@ static bool accept_waiting_clients(const struct lwan_thread *t,
                 send_last_response_without_coro(t->lwan, conn, HTTP_UNAVAILABLE);
                 conn->flags = 0;
             }
+
+            if (conn->thread != t) {
+                pthread_kill(conn->thread->self, SIGPOLL);
+            }
 
             continue;
         }
@@ -999,6 +1004,12 @@ static void create_thread(struct lwan *l, struct lwan_thread *thread)
 
     if (pthread_attr_destroy(&attr))
         lwan_status_critical_perror("pthread_attr_destroy");
+
+    sigset_t set;
+    sigemptyset(&set);
+    sigaddset(&set, SIGPOLL);
+    if (pthread_sigmask(SIG_BLOCK, &set, NULL) != 0)
+        lwan_status_critical("could not block SIGPOLL");
 }
 
 #if defined(__linux__) && defined(__x86_64__)

@pontscho
Copy link
Contributor Author

pontscho commented Jan 12, 2023

Hi,

I've added a printf() before pthread_kill() just for fun:

die alive
die alive
die alive
die alive
die alive
I/O possible

It dies immediately after I start my test tool.

@pontscho
Copy link
Contributor Author

Defenetelly timeout handler "losing" fds I think.

With the original code life cycle of the fd was:

0x7f8a17e6d700 ACCEPT: 560
0x7f8a1866e700 REMOVE: 560
0x7f8a1866e700 CLOSE: 560
0x7f8a17e6d700 ACCEPT: 560

(pthread_self(), event, fd)
events:

  • ACCEPT: accept_waiting_clients()
  • INSERT: timeout_queue_insert()
  • BUMP: timeout_queue_move_to_last()
  • REMOVE+CLOSE: timeout_queue_expire()

And this socket was never closed by timeout_queue_expire_waiting() or timeout_queue_expire().

After this probe I added timeout_queue struct to lwan_thread struct and accept_waiting_clients () inserts the new fd into this thread specific timeout queue instead of after the coro was spawned.

After I append the new socket to the timeout queue:

0x7f218a0a9700 ACCEPT: 999
0x7f218a0a9700 INSERT: 999
0x7f218e471700 BUMP: 999
0x7f218e471700 BUMP: 999
0x7f218e471700 BUMP: 999
0x7f218e471700 BUMP: 999
0x7f218e471700 BUMP: 999
0x7f218e471700 BUMP: 999
0x7f218e471700 REMOVE: 999
0x7f218e471700 CLOSE: 999
0x7f2191c7a700 ACCEPT: 999
0x7f2191c7a700 INSERT: 999

Much better, but that socket still stuck. I'm afraid you know better how is this timeout wheel thing works.

@pontscho
Copy link
Contributor Author

pontscho commented Jan 12, 2023

This very ugly patch solves not just these two issues but #348 too. So, the timeout wheel / queue loosing somewhere sockets, not just where lwan accepts new incoming connections.

@pontscho
Copy link
Contributor Author

pontscho commented Jan 13, 2023

I've changed that list based approach a more conservative version. Instead of list I used an old fashion array without locks based on that idea the fd array is a static array.

Only weak points are speed of clock_gettime() (because needs just second precision possible to change this call to a "clock service" (for example a thread just increment a counter)) and timeout loop. But I'm not really afraid about that function because needs only 0.5-1 millisecond with 4+ workers and 2-3 milliseconds with only one worker for more than million sockets. This patch kept the original time wheel infrastructure for scheduling timeouts. And it also solves these two problems and that in #348.

diff --git a/lwan-tq.c b/lwan-tq.c

/*
 * lwan - web server
 * Copyright (c) 2019 L. A. F. Pereira <[email protected]>
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version 2
 * of the License, or any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301,
 * USA.
 */

#include <unistd.h>
#include <errno.h>

#include "lwan-private.h"
#include "lwan-tq.h"

inline void timeout_queue_insert(struct timeout_queue *tq, struct lwan_connection *new_node)
{
    struct timespec now;

    if (UNLIKELY(clock_gettime(CLOCK_MONOTONIC, &now) < 0))
        lwan_status_perror("clock_gettime");

    new_node->time_to_expire = now.tv_sec + tq->move_to_last_bump;
}

inline void timeout_queue_move_to_last(struct timeout_queue *tq, struct lwan_connection *conn)
{
    timeout_queue_insert(tq, conn);
}

void timeout_queue_init(struct timeout_queue *tq, const struct lwan *lwan)
{
    struct lwan_connection **conns = calloc(lwan->thread.max_fd, sizeof(struct lwan_connection *));

    if (UNLIKELY(!conns))
        lwan_status_perror("calloc");

    int offs = 0;
    const int fds = lwan->thread.max_fd * lwan->thread.count;
    for (struct lwan_connection *conn = lwan->conns; conn - lwan->conns < fds;conn++) {
        if (conn->thread->tq == tq) {
            conns[offs++] = conn;
        }
    }

    *tq = (struct timeout_queue){
        .lwan = lwan,
        .conns = conns,
        .move_to_last_bump = lwan->config.keep_alive_timeout,
        .timeout = (struct timeout){},
    };
}

inline void timeout_queue_expire(struct timeout_queue *tq, struct lwan_connection *conn)
{
    if (LIKELY(conn->coro)) {
        coro_free(conn->coro), conn->coro = NULL;
    }

    conn->time_to_expire = 0;
    int err = close(lwan_connection_get_fd(tq->lwan, conn));
    if (err < 0)
        lwan_status_error("lwan-tq close error: %d, %s", lwan_connection_get_fd(tq->lwan, conn), strerror(errno));
}

void timeout_queue_expire_waiting(struct timeout_queue *tq)
{
    struct timespec now;

    if (UNLIKELY(clock_gettime(CLOCK_MONOTONIC, &now) < 0))
        lwan_status_perror("clock_gettime");

    const int fds = tq->lwan->thread.max_fd;
    for (struct lwan_connection **conn = tq->conns;conn - tq->conns < fds;conn++) {
        if ((*conn)->time_to_expire && now.tv_sec >= (*conn)->time_to_expire) {
            timeout_queue_expire(tq, *conn);
        }
    }
}

void timeout_queue_expire_all(struct timeout_queue *tq) {}

--- lwan-thread.c
+++ lwan-thread.c
@@ -698,11 +698,10 @@
     *conn = (struct lwan_connection){
         .coro = coro_new(switcher, process_request_coro, conn),
         .flags = CONN_EVENTS_READ | flags_to_keep,
-        .time_to_expire = tq->current_time + tq->move_to_last_bump,
+        .time_to_expire = conn->time_to_expire,
         .thread = t,
     };
     if (LIKELY(conn->coro)) {
-        timeout_queue_insert(tq, conn);
         return true;
     }

@@ -742,12 +741,8 @@
          * update the date cache at this point as well.  */
         update_date_cache(t);

-        if (!timeout_queue_empty(tq)) {
-            timeouts_add(t->wheel, &tq->timeout, 1000);
-            return true;
-        }
-
-        timeouts_del(t->wheel, &tq->timeout);
+        timeouts_add(t->wheel, &tq->timeout, 1000);
+        return true;
     }

     return false;
@@ -811,6 +806,7 @@
             int r;

             conn->flags = new_conn_flags;
+            timeout_queue_insert(conn->thread->tq, conn);

             r = epoll_ctl(conn->thread->epoll_fd, EPOLL_CTL_ADD, fd, &ev);
             if (UNLIKELY(r < 0)) {
@@ -910,6 +906,7 @@

     update_date_cache(t);

+    t->tq = &tq;
     timeout_queue_init(&tq, lwan);

     lwan_random_seed_prng_for_thread(t);
@@ -1138,6 +1135,9 @@
     if (pthread_setaffinity_np(thread->self, sizeof(set), &set))
         lwan_status_warning("Could not set thread affinity");
 }
+#else
+static void
+adjust_thread_affinity(const struct lwan_thread __attribute__((unused)) *thread) {}
 #endif

 #if defined(LWAN_HAVE_MBEDTLS)
diff --git a/lwan-tq.h b/lwan-tq.h

 struct timeout_queue {
     const struct lwan *lwan;
-    struct lwan_connection *conns;
+    struct lwan_connection **conns;
     struct lwan_connection head;
     struct timeout timeout;
     unsigned int current_time;
     unsigned int move_to_last_bump;
 };

 void timeout_queue_init(struct timeout_queue *tq, const struct lwan *l);

diff --git a/lwan.h b/lwan.h

struct lwan_thread {
    struct lwan *lwan;
    struct {
        char date[30];
        char expires[30];
    } date;
    int epoll_fd;
    struct timeouts *wheel;
    int listen_fd;
    int tls_listen_fd;
    unsigned int cpu;
    pthread_t self;
+   struct timeout_queue *tq;
};

@lpereira
Copy link
Owner

lpereira commented Jan 13, 2023 via email

@pontscho
Copy link
Contributor Author

I was sure about it, that's why this is here, not PR.

Locks aren't needed because the tq is per-thread by design (and should never be exposed outside lwan-thread or have one thread access another's tq), and if adding locks helped, the bug is somewhere else.

Agree.

I drop that lock, it isn't necessary because I dropped the list, it was necessary in the first rude patch because most of the time threads were different that accept and maintain the connection and list handling in some cases caused segfaults. It’s good enough for my purposes.

Using an array like this is very inneficient, both in memory usage and run time costs. Even if the array were of tq* rather than conn*, it would still be too expensive memory-wise. I'm not sure if a thread sleeps if there's no connection in a tq either; it's been a while I've been in that part of the code.

I don’t afraid about memory consumpsion, it needs just more 8MB ram for one million socket (usually we use that socket limit in production), timing a little bit trickie. Because that patch above (not in github messages, please check the latest one on github page) uses the original timeout wheel thing, the thread is sleeping in idle mode. I removed that timeout hack before epoll() was called.

Either might help prove that there's some bug in the current code but neither are the solution to these issues.

It do, I’ve tested it all day, worked very well with my last modifications.

I don't say this is the only way for fixing this issue, just an example which good enough for me until real solution fix this. I have no idea yet where and what breaks the original timeout queue implementation.

@pontscho
Copy link
Contributor Author

I have no idea yet where and what breaks the original timeout queue implementation.

Not exactly true because

  • different threads accept and handle the connection because the pre scheduling before lwan initialise the workers and the sender doesn't send anything after accept lwan that connection loose socket because it isn't in tq and never closed
  • from some reasons closed connection stuck in tq and tq closes it multiple times (I think answer is in git log)
  • the timeout wheel loose tq and in specific cases the connection stuck in tq until next connection arrive

This three cases cause this and #348.

@lpereira
Copy link
Owner

lpereira commented Jan 19, 2023

I think I know what's happening here. I have an idea of how to fix this by adding a per-thread queue of pending file descriptors, and using something like a pipe or an eventfd, to wake the thread up like it used to be done. It should be possible to adapt Vyukov's intrusive MPSC queue to use the pre-allocated lwan_connections, and avoid extra allocations altogether. Might be useful to also revisit the CBPF program to direct the accepted socket to the right thread in the kernel, which should improve performance quite a bit (as we pin threads to CPUs and whatnot).

@pontscho
Copy link
Contributor Author

It’s sounds good. I’m happy about you have better idea. :)

For testing this problem, I run my workaround on few servers since Monday, after few million served requests I can confirm it fixed the original issues. Not optimal but works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants