[Help] The service suddenly hangs and becomes unresponsive. #968

rambo-panda · 2024-12-27T08:15:48Z

Recently, I deployed a Node.js service in a Docker container, and there are occasional instances of the service becoming unresponsive. Since I'm not very familiar with Skia, I'm not sure if it's a Skia issue. So I'm asking everyone for ideas on how to approach this(Willing to pay for advice.).

Phenomenon: The service does not accept any signals. The service is in an Ssl state. It seems that the internal code of the service is in a "Stop the world" state, and its internal code execution is suspended (such as zookeeper heartbeat not sent).

Suspicious business process

Cache images (because most of the images are the same, so cache reuse) [the biggest suspect in my option]

 const cache = new Map();
 const getImg = async (url) => {
   if (cache.has(url)) {
     const ret =  cache.get(url);
     ret._timer();
     return ret;
   }
 
   cache.set(
     url,
     get(url)
       .then(loadImage)
       .then((img) => {
         // This is a unified encapsulation, similar to a debounce algorithm.
         let _timer = () => {};
         img._timer = () => {
            _timer = clearTimeout.bind(
               null,
               setTimeout(cache.delete.bind(cache, url), lts).unref()
             );
         };
         cache.set(url, img);
         return img;
       }),
   );
 
   return getImg(url);
 };

I suspect that during the deletion of lts, the img in drawImage(img) is being garbage collected, which leads to the mutex lock waiting. or a private method(_timer) is bound to img.

Because there are many lines to be drawn, I borrowed the concept of fibers and enforced waiting.

     for (let i = 0; i < dots.length; i++) {
        if (i % 7_000 === 0) {
          await sleep(2);
        }
        ctx.lineTo(...dots[i]);
      }

env info

System:
OS: Linux 3.10 Ubuntu 22.04.1 LTS 22.04.1 LTS (Jammy Jellyfish)
CPU: (40) x64 Intel Xeon Processor (Cascadelake)
Memory: 60.21 GB / 78.66 GB
Container: Yes
Shell: 5.1.16 - /bin/bash
Glibc : (Ubuntu GLIBC 2.35-0ubuntu3.1) 2.35
Binaries:
Node: 20.16.0 - /usr/bin/node
npm: 10.8.1 - /usr/bin/npm

 @napi-rs/[email protected]

Call stack:

cat /proc/14972/stack
[<ffffffffaa90cfa6>] futex_wait_queue_me+0xc6/0x130
[<ffffffffaa90dc8b>] futex_wait+0x17b/0x280
[<ffffffffaa90f9d6>] do_futex+0x106/0x5a0
[<ffffffffaa90fef0>] SyS_futex+0x80/0x190
[<ffffffffaaf75d9b>] system_call_fastpath+0x22/0x27
[<ffffffffffffffff>] 0xffffffffffffffff

gdb stack:

(gdb) bt
#0  0x00007fe0b14da197 in __pthread_attr_getaffinity_new (attr=0x0, cpusetsize=393, cpuset=0x0) at ./nptl/pthread_attr_getaffinity.c:35
#1  0x0000000000000000 in ?? ()


(gdb) thread apply all bt

Thread 11 (LWP 14982 "node"):
#0  0x00007fe0b14da197 in __pthread_attr_getaffinity_new (attr=0x0, cpusetsize=393, cpuset=0x0) at ./nptl/pthread_attr_getaffinity.c:35
#1  0x0000000000000000 in ?? ()

Thread 10 (LWP 14981 "node"):
#0  0x00007fe0b14da197 in __pthread_attr_getaffinity_new (attr=0x0, cpusetsize=393, cpuset=0x0) at ./nptl/pthread_attr_getaffinity.c:35
#1  0x0000000000000000 in ?? ()

Thread 9 (LWP 14980 "node"):
#0  0x00007fe0b14da197 in __pthread_attr_getaffinity_new (attr=0x0, cpusetsize=393, cpuset=0x0) at ./nptl/pthread_attr_getaffinity.c:35
#1  0x0000000000000000 in ?? ()

Thread 8 (LWP 14979 "node"):
#0  0x00007fe0b14da197 in __pthread_attr_getaffinity_new (attr=0x0, cpusetsize=393, cpuset=0x0) at ./nptl/pthread_attr_getaffinity.c:35
#1  0x0000000000000000 in ?? ()

Thread 7 (LWP 14978 "node"):
#0  0x00007fe0b14da197 in __pthread_attr_getaffinity_new (attr=0x0, cpusetsize=393, cpuset=0x0) at ./nptl/pthread_attr_getaffinity.c:35
#1  0x0000000000000000 in ?? ()

Thread 6 (LWP 14977 "node"):
#0  0x00007fe0b14da059 in __GI___pthread_attr_copy (target=0x690a550, source=0x690a548) at ./nptl/pthread_attr_copy.c:47
#1  0x0000000000000002 in ?? ()
#2  0x00003d6343705391 in ?? ()
#3  0x00007fe0aaffbdb0 in ?? ()
#4  0x00000000013b0dd3 in v8::internal::FindClosestElementsTransition(v8::internal::Isolate*, v8::internal::Map, v8::internal::ElementsKind, v8::internal::ConcurrencyMode) ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 5 (LWP 14976 "node"):
#0  0x00007fe0b14da197 in __pthread_attr_getaffinity_new (attr=0x0, cpusetsize=393, cpuset=0x0) at ./nptl/pthread_attr_getaffinity.c:35
#1  0x0000000000000000 in ?? ()

Thread 4 (LWP 14975 "node"):
#0  0x00007fe0b14da197 in __pthread_attr_getaffinity_new (attr=0x0, cpusetsize=393, cpuset=0x0) at ./nptl/pthread_attr_getaffinity.c:35
#1  0x0000000000000000 in ?? ()

Thread 3 (LWP 14974 "node"):
#0  0x00007fe0b14da059 in __GI___pthread_attr_copy (target=0x690a550, source=0x690a548) at ./nptl/pthread_attr_copy.c:47
#1  0x0000000000000002 in ?? ()
#2  0x00003d6343705391 in ?? ()
#3  0x00007fe0b0c40d80 in ?? ()
#4  0x00000000013b0dd3 in v8::internal::FindClosestElementsTransition(v8::internal::Isolate*, v8::internal::Map, v8::internal::ElementsKind, v8::internal::ConcurrencyMode) ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 2 (LWP 14973 "node"):
#0  0x00007fe0b156ed18 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x0000000ab143efc0 in ?? ()
#2  0x00007fe0b143fc70 in ?? ()
#3  0xffffffff00000400 in ?? ()
#4  0x0000000000000000 in ?? ()

Thread 1 (LWP 14972 "node /var/www/w"):
#0  0x00007fe0b14da197 in __pthread_attr_getaffinity_new (attr=0x0, cpusetsize=393, cpuset=0x0) at ./nptl/pthread_attr_getaffinity.c:35
#1  0x0000000000000000 in ?? ()

The text was updated successfully, but these errors were encountered:

valamidev · 2025-01-06T18:01:46Z

This code snippet looks very sus, but I can recommend in case using this lib a lot in back-end having the service restarted time to time, it still has multiple memory-leak.

const cache = new Map();
 const getImg = async (url) => {
   if (cache.has(url)) {
     const ret =  cache.get(url);
     ret._timer();
     return ret;
   }
 
   cache.set(
     url,
     get(url)
       .then(loadImage)
       .then((img) => {
         // This is a unified encapsulation, similar to a debounce algorithm.
         let _timer = () => {};
         img._timer = () => {
            _timer = clearTimeout.bind(
               null,
               setTimeout(cache.delete.bind(cache, url), lts).unref()
             );
         };
         cache.set(url, img);
         return img;
       }),
   );
 
   return getImg(url);
 };

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Help] The service suddenly hangs and becomes unresponsive. #968

[Help] The service suddenly hangs and becomes unresponsive. #968

rambo-panda commented Dec 27, 2024 •

edited

Loading

valamidev commented Jan 6, 2025

[Help] The service suddenly **hangs** and becomes unresponsive. #968

[Help] The service suddenly **hangs** and becomes unresponsive. #968

Comments

rambo-panda commented Dec 27, 2024 • edited Loading

valamidev commented Jan 6, 2025

[Help] The service suddenly hangs and becomes unresponsive. #968

[Help] The service suddenly hangs and becomes unresponsive. #968

rambo-panda commented Dec 27, 2024 •

edited

Loading