EG with two kubernetes replicas and Nginx Operator fail to use the second replicas #1209

pyverdon · 2022-12-05T14:21:20Z

Description

Hello,
My kubernetes cluster use an Nginx Operator. the EG ingress annotations are set as below via EG helm charts.
My goal is to have two active/active replicas running, with sharding based on incoming ip.

With a single EG pod : The websocket works fine and the EG can be reached and a python_kubernetes kernel is spawned in a namespace
With two replica and a nginx hash on remote_addr: the websocket works only to one pod (from my test). All new kernel are created on the same pod and are not balanced to the second one. Killing the first pod and starting a new session with a new kernel will reach the second pod, but will failed with a kernel message indicating that the "kernel have been culled or died unexpectedly".

Please note that this is my first try . My next move will be to use persisted session

Reproduce

deploy the EG on kubernetes with the helm charts and ingress annotations below

ingress:
  enabled: true
  nginx:
    enabled: true
    
  hostName: "enterprise-gateway.mypocurl.com" # whether to expose by setting a host-based ingress rule, default is *
  # The .spec.ingressClassName behavior has precedence over the deprecated kubernetes.io/ingress.class annotation.
  # https://kubernetes.github.io/ingress-nginx/
  ingressClassName: "nginx-admin"
  pathType: "Prefix"

  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /$1
    nginx.ingress.kubernetes.io/ssl-redirect: "false"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "false"
    nginx.org/websocket-services: "enterprise-gateway"
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
    nginx.ingress.kubernetes.io/upstream-hash-by: "$binary_remote_addr"
  path: /gateway/?(.*)      # URL context used to expose EG
  tls:
  - hosts:
    - enterprise-gateway.mypocurl.com

[W 2022-12-05 14:39:27.163 ServerApp] 500 POST /api/sessions?1670247563419 (127.0.0.1): HTTP 404: Not Found (Session not found: "Kernel 'c9ffae52-cf4e-47d7-a5ac-3b46c4bd48fd' appears to have been culled or died unexpectedly, invalidating session 'f34ee9d4-a105-4cfd-9e69-360152b6d0d7'. The session has been removed.")
[W 2022-12-05 14:39:27.164 ServerApp] wrote error: 'HTTP 404: Not Found (Session not found: "Kernel \'c9ffae52-cf4e-47d7-a5ac-3b46c4bd48fd\' appears to have been culled or died unexpectedly, invalidating session \'f34ee9d4-a105-4cfd-9e69-360152b6d0d7\'. The session has been removed.")'
[E 2022-12-05 14:39:27.164 ServerApp] {
      "Host": "localhost:8888",
      "Accept": "*/*",
      "Referer": "http://localhost:8888/lab/tree/Untitled.ipynb",
      "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:107.0) Gecko/20100101 Firefox/107.0"
    }
[E 2022-12-05 14:39:27.164 ServerApp] 500 POST /api/sessions?1670247563419 (127.0.0.1) 3742.07ms referer=http://localhost:8888/lab/tree/Untitled.ipynb
[W 2022-12-05 14:39:28.049 ServerApp] Kernel af1ef98b-dda5-4e3c-9445-5fe89aed534b no longer active - probably culled on Gateway server.

Context

JEG version 3.0.0
Spark version: 3.3.0
Kubernetes 1.23
Nginx version: 1.19.9

The text was updated successfully, but these errors were encountered:

welcome · 2022-12-05T14:21:24Z

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.

You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

kevin-bates · 2022-12-05T17:16:56Z

Hi @pyverdon - thanks for using EG, especially with a look toward HA.

With two replica and a nginx hash on remote_addr: the websocket works only to one pod (from my test). All new kernel are created on the same pod and are not balanced to the second one. Killing the first pod and starting a new session with a new kernel will reach the second pod, but will failed with a kernel message indicating that the "kernel have been culled or died unexpectedly".

Do you mean node where pod is used here?

My next move will be to use persisted session

Yes, no HA-based functionality will work without persisted sessions. Please see our docs regarding availability modes.

Also note that your clients (i.e., notebook sessions) may need to issue a kernel "reconnect" request to get the kernel re-synced. This, along with potential culling functionality, is also referenced in that section of the docs. The culling timeout will essentially reset when a kernel's management changes between EG pods.

pyverdon added the bug label Dec 5, 2022

kevin-bates added configuration performance & scalability and removed bug labels Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EG with two kubernetes replicas and Nginx Operator fail to use the second replicas #1209

EG with two kubernetes replicas and Nginx Operator fail to use the second replicas #1209

pyverdon commented Dec 5, 2022 •

edited

Loading

welcome bot commented Dec 5, 2022

kevin-bates commented Dec 5, 2022

EG with two kubernetes replicas and Nginx Operator fail to use the second replicas #1209

EG with two kubernetes replicas and Nginx Operator fail to use the second replicas #1209

Comments

pyverdon commented Dec 5, 2022 • edited Loading

Description

Reproduce

Context

welcome bot commented Dec 5, 2022

kevin-bates commented Dec 5, 2022

pyverdon commented Dec 5, 2022 •

edited

Loading