-
-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault running lwan in linux container with number of cores lower than number of physical cores #290
Comments
That's strange. Unfortunately that backtrace doesn't help much. Can you try
building with Debug again, but doing a clean build? (Either from that
directory without CMakeCache.txt, or in an empty directory.)
Also, how many cores are we talking about?
…On Fri, Sep 4, 2020, 10:06 diviaki ***@***.***> wrote:
I moved an lxc container running for years to a new host just to see lwan
crashing in it. I realised on the old host the number of allowed cores and
number of physical cores were same, while the new host got much more cores.
Cleaning the core limit of the container fixed the issue.
However, this is how lwan crashed with the limit on:
==6791== Use of uninitialised value of size 8
==6791== at 0x11A8D0: lwan_thread_add_client (in /usr/local/bin/lwan)
==6791== by 0x1FFEFFF4DF: ???
==6791== by 0xBFFFFFFFF: ???
==6791== by 0x1FFEFFF4DF: ???
==6791== by 0x501B847: ???
==6791== by 0x3F: ???
==6791== by 0x1FFEFFE33F: ???
==6791== by 0x8F9C18F9C18F9C18: ???
==6791== by 0x111392: lwan_main_loop.cold.28 (in /usr/local/bin/lwan)
==6791== by 0x3FF: ???
==6791== by 0xB: ???
==6791==
==6791== Invalid write of size 4
==6791== at 0x11A8DA: lwan_thread_add_client (in /usr/local/bin/lwan)
==6791== by 0x1FFEFFF4DF: ???
==6791== by 0xBFFFFFFFF: ???
==6791== by 0x1FFEFFF4DF: ???
==6791== by 0x501B847: ???
==6791== by 0x3F: ???
==6791== by 0x1FFEFFE33F: ???
==6791== by 0x8F9C18F9C18F9C18: ???
==6791== by 0x111392: lwan_main_loop.cold.28 (in /usr/local/bin/lwan)
==6791== by 0x3FF: ???
==6791== by 0xB: ???
==6791== Address 0x19089ae0 is not stack'd, malloc'd or (recently) free'd
==6791==
==6791==
==6791== Process terminating with default action of signal 11 (SIGSEGV)
This is from a Release build as trying to compile a *Debug* results in
/root/lwan/src/lib/hash.c:163: undefined reference to
'__builtin_ia32_crc32si'
and *RelWithDebInfo* segfaults the compiler at
[ 30%] Building C object
src/lib/CMakeFiles/lwan-static.dir/lwan-mod-serve-files.c.o
Sources just pulled, deb10, gcc8.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#290>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADVGJNUY2SLGUVQXF7OYTSEENBXANCNFSM4QY56OAQ>
.
|
Success! Deleting the build folder before changing build type allowed building debug (RelWithDebInfo still segfaults cc) Config is 2 cores allowed out of 8 (4 HT cores) on a unprivileged lxc node. Test runs:
Release build starts with these warnings:
Debug build starts with (only relevant lines):
An nginx proxy were used for testing, relevant lines:
|
Good! Do you have a backtrace with the debug version? Can you install
address sanitizer too and rebuild?
…On Mon, Sep 7, 2020, 00:59 diviaki ***@***.***> wrote:
Success! Deleting the build folder before changing build type allowed
building debug (RelWithDebInfo still segfaults cc)
Config is 2 cores allowed out of 8 (4 HT cores) on a unprivileged lxc node.
Test runs:
- release: it works
- proxy + release: segfault
- proxy + release + valgrind: segfault (output sent earlier)
- debug: segfault
- debug + valgrind : it works
- proxy + debug: segfault
- proxy + debug + valgrind : it works
Release build starts with these warnings:
Could not set affinity for thread 0
Could not set affinity for thread 1
Debug build starts with (only relevant lines):
7401 lwan.c:723 lwan_init_with_config() Using 2 threads, maximum 262144 sockets per thread
7401 lwan-thread.c:701 lwan_thread_init() Initializing threads
7404 lwan-thread.c:453 thread_io_loop() Worker thread #1 starting
7401 lwan-thread.c:678 adjust_threads_affinity() Could not set affinity for thread 0
7401 lwan-thread.c:678 adjust_threads_affinity() Could not set affinity for thread 1
7405 lwan-thread.c:453 thread_io_loop() Worker thread #2 starting
An nginx proxy were used for testing, relevant lines:
proxy_pass http://192.168.xxx.xxx:8080;
proxy_set_header X-Real-IP $remote_addr;
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#290 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADVGOBVS6OPX2E6DRAOZDSESHE3ANCNFSM4QY56OAQ>
.
|
|
let me know if you want too ssh into the thing |
That's very curious! I'll take a look whenever I'm working on Lwan again. I'm pausing work on most of my personal projects for undetermined time, so while I appreciate your offer to SSH into that machine, I won't be able to do that right now. In the meantime, you can apply this patch here that should make Lwan work in your environment: diff --git a/src/lib/lwan-thread.c b/src/lib/lwan-thread.c
index b1ee42da..32db393c 100644
--- a/src/lib/lwan-thread.c
+++ b/src/lib/lwan-thread.c
@@ -712,7 +712,7 @@ void lwan_thread_init(struct lwan *l)
create_thread(l, &l->thread.threads[i], n_queue_fds);
const unsigned int total_conns = l->thread.max_fd * l->thread.count;
-#ifdef __x86_64__
+#if 0
static_assert(sizeof(struct lwan_connection) == 32,
"Two connections per cache line");
/* |
Actually, I think I know what's going on. You mentioned the computer has 8 cores, but only 2 threads are being spawn... this makes me think that LXC is limiting the number of cores reported by diff --git a/src/lib/lwan-thread.c b/src/lib/lwan-thread.c
index b1ee42da..0e844767 100644
--- a/src/lib/lwan-thread.c
+++ b/src/lib/lwan-thread.c
@@ -617,10 +617,27 @@ static bool read_cpu_topology(struct lwan *l, uint32_t siblings[])
__builtin_unreachable();
}
-
fclose(sib);
}
+ /* Some systems may lie about the number of online CPUs (obtainable with
+ * sysconf()), but don't filter out the CPU topology information from
+ * sysfs, which might reference CPU numbers higher than the amount
+ * obtained with sysconf(). */
+ for (unsigned int i = 0; i < l->n_cpus; i++) {
+ if (siblings[i] == 0xbebacafe) {
+ lwan_status_warning("Could not determine sibling for CPU %d", i);
+ return false;
+ }
+
+ if (siblings[i] > l->n_cpus) {
+ lwan_status_warning("CPU topology information says CPU %d exists, "
+ "but max CPUs is %d. Is Lwan running in a "
+ "container?", siblings[i], l->n_cpus);
+ return false;
+ }
+ }
+
return true;
}
@@ -651,6 +668,9 @@ topology_to_schedtbl(struct lwan *l, uint32_t schedtbl[], uint32_t n_threads)
{
uint32_t *siblings = alloca(l->n_cpus * sizeof(uint32_t));
+ for (uint32_t i = 0; i < l->n_cpus; i++)
+ siblings[i] = 0xbebacafe;
+
if (!read_cpu_topology(l, siblings)) {
for (uint32_t i = 0; i < n_threads; i++)
schedtbl[i] = (i / 2) % l->thread.count;
|
I pushed this patch, as it doesn't hurt if this condition is never met. Please confirm if this fixes your issue. |
Thanks for quick patch. Applied it. The strange report on number of cores made me think what if I set the count higher than 4 or something not even?
Looks like container management use some magic? /I see this getting muddy, so if you're on a sabbatical on something, feel free to give it a break too - I use lxc for keeping stuff separated, not for limiting resources and I'm totally fine running it with unlimited cores/ |
Found some info.. I use Proxmox on the virtualisation host which in turn builds on lxc. |
Ah, yeah. I figured this could be the case. I do have a sketch of a patch that takes the number of configured (but not online) CPUs into consideration when allocating memory to calculate the affinity mask. I might push it this weekend if you'd like to test it out. |
Hey @diviaki -- if you'd like to try, I pushed a patch that should work in your setup regardless of how many cores is allocated by LXC. |
Still on a 2 out of 8 cores setup, where cores 5 and 6 assigned out of 0..7
|
Thanks for testing. Yeah, that seems related even though it exploded somewhere else. Let's see if I can carve some time this weekend for this. (Will try testing it locally, too.) |
I moved an lxc container running for years to a new host just to see lwan crashing in it. I realised on the old host the number of allowed cores and number of physical cores were same, while the new host got much more cores. Cleaning the core limit of the container fixed the issue.
However, this is how lwan crashed with the limit on:
This is from a Release build as trying to compile a Debug results in
/root/lwan/src/lib/hash.c:163: undefined reference to '__builtin_ia32_crc32si'
and RelWithDebInfo segfaults the compiler at
[ 30%] Building C object src/lib/CMakeFiles/lwan-static.dir/lwan-mod-serve-files.c.o
Sources just pulled, deb10, gcc8.
The text was updated successfully, but these errors were encountered: