Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xcluster; forwarder-vpp in nsm v1.8.0 crashes #388

Open
uablrek opened this issue Mar 3, 2023 · 1 comment
Open

Xcluster; forwarder-vpp in nsm v1.8.0 crashes #388

uablrek opened this issue Mar 3, 2023 · 1 comment
Labels
kind/bug Something isn't working

Comments

@uablrek
Copy link
Contributor

uablrek commented Mar 3, 2023

Describe the bug

One trench usually works, but when starting more trenches the forwarder-vpp on the nodes where the Meridio LB is deployed dives into a crash-loop.

I saw OOM-kill once when monitoring with "watch", but I am very unsure that's the root cause

Expected behavior

No crash

Context

  • Kernel: 6.2.0
  • Kubernetes: v1.26.1
  • Spire: 1.5.1
  • Network Service Mesh: v1.8.0
  • Meridio: Local build on master and v1.0.1
    ...

Logs

The last entries before re-start below:

Mar  3 07:44:54.488 [ERRO] [cmd:/bin/forwarder] signal: killed
time="2023-03-03T07:44:54Z" level=warning msg="Unable to send message" channel=168 context=22020097 data_len=19 error="write unix @->/var/run/vpp/api.sock: write: broken pipe" is_multi=false msg_crc=4fa28a85 msg_id=816 msg_name=sw_interface_set_l2_xconnect seq_num=1
time="2023-03-03T07:44:54Z" level=warning msg="Unable to send message" channel=169 context=22151169 data_len=24 error="write unix @->/var/run/vpp/api.sock: write: broken pipe" is_multi=false msg_crc=d0678b13 msg_id=818 msg_name=sw_interface_set_l2_bridge seq_num=1
time="2023-03-03T07:44:54Z" level=warning msg="Unable to send message" channel=170 context=22282241 data_len=74 error="write unix @->/var/run/vpp/api.sock: write: broken pipe" is_multi=false msg_crc=863fa648 msg_id=1302 msg_name=af_packet_delete seq_num=1
time="2023-03-03T07:44:54Z" level=warning msg="Unable to send message" channel=171 context=22413313 data_len=19 error="write unix @->/var/run/vpp/api.sock: write: broken pipe" is_multi=false msg_crc=4fa28a85 msg_id=816 msg_name=sw_interface_set_l2_xconnect seq_num=1
time="2023-03-03T07:44:54Z" level=warning msg="Unable to send message" channel=172 context=22544385 data_len=74 error="write unix @->/var/run/vpp/api.sock: write: broken pipe" is_multi=false msg_crc=863fa648 msg_id=1302 msg_name=af_packet_delete seq_num=1
time="2023-03-03T07:44:54Z" level=warning msg="Unable to send message" channel=173 context=22675457 data_len=19 error="write unix @->/var/run/vpp/api.sock: write: broken pipe" is_multi=false msg_crc=4fa28a85 msg_id=816 msg_name=sw_interface_set_l2_xconnect seq_num=1
time="2023-03-03T07:44:54Z" level=warning msg="Unable to send message" channel=174 context=22806529 data_len=138 error="write unix @->/var/run/vpp/api.sock: write: broken pipe" is_multi=false msg_crc=2457116d msg_id=1373 msg_name=add_node_next seq_num=1
Mar  3 07:44:54.513 [ERRO] [cmd:/bin/forwarder] <nil>

I interpret the "signal: killed" as a failed probe which also is indicated by;

# kubectl describe pod forwarder-vpp-kksqq
...
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  8m4s                 default-scheduler  Successfully assigned default/forwarder-vpp-kksqq to vm-002
  Normal   Pulling    8m4s                 kubelet            Pulling image "ghcr.io/networkservicemesh/cmd-forwarder-vpp:v1.8.0"
  Normal   Pulled     8m1s                 kubelet            Successfully pulled image "ghcr.io/networkservicemesh/cmd-forwarder-vpp:v1.8.0" in 3.191323307s (3.191332663s including waiting)
  Warning  Unhealthy  97s (x3 over 7m59s)  kubelet            Readiness probe failed: timeout: failed to connect service "unix:///listen.on.sock" within 1s
  Normal   Created    58s (x4 over 8m)     kubelet            Created container forwarder-vpp
  Normal   Started    58s (x4 over 8m)     kubelet            Started container forwarder-vpp
  Normal   Pulled     58s (x3 over 2m5s)   kubelet            Container image "ghcr.io/networkservicemesh/cmd-forwarder-vpp:v1.8.0" already present on machine
  Warning  Unhealthy  57s                  kubelet            Readiness probe failed: command timed out
  Warning  BackOff    44s (x9 over 115s)   kubelet            Back-off restarting failed container forwarder-vpp in pod forwarder-vpp-kksqq_default(afab3247-3397-415d-a4a1-f943c279a981)

Even when everything is OK (one trench) there are error printouts in the forwarder-vpp log. The problem seem to be transient though;

Mar  3 07:39:26.091 [ERRO] [cmd:/bin/forwarder] [component:netNsMonitor] [inodeURL:inode://4/4026532665] inode 4026532665 is not found in /proc
time="2023-03-03T07:39:26Z" level=info msg="No subscription found for the notification message." msg_id=701 msg_size=19
time="2023-03-03T07:39:26Z" level=info msg="No subscription found for the notification message." msg_id=701 msg_size=19
time="2023-03-03T07:39:26Z" level=error msg="Channel ID not known, ignoring the message." channel=132 msg_id=24
time="2023-03-03T07:39:26Z" level=info msg="No subscription found for the notification message." msg_id=701 msg_size=19
time="2023-03-03T07:41:15Z" level=error msg="Channel ID not known, ignoring the message." channel=176 msg_id=24

I also got an enormous stack trace once.

@LionelJouin LionelJouin moved this to 📋 To Do in Meridio Mar 3, 2023
@LionelJouin LionelJouin added the kind/bug Something isn't working label Mar 3, 2023
@uablrek
Copy link
Contributor Author

uablrek commented Mar 3, 2023

Altering the liveness/readiness timeouts did not help. Not even removing the probes. Forwarder-vpp still crashes with "signal: killed". Since no probe fails, I don't know where the signal comes from.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
Status: 📋 To Do
Development

No branches or pull requests

2 participants