Node 18 master pod restart fix #3477

sotojn · 2023-11-27T22:36:21Z

I have fixed a bug that happens on node 18 that is addressed here #3457

The issue was that the execution client has a variable called serverShutdown that gets set to true when the masterpod is told to shutdown. But when the master pod is booted back up and reconnects with the client, the client can no longer send execution analytics to the master pod causing the following error.

[2023-11-27T17:49:38.468Z] ERROR: teraslice/10 on ts-exc-kafka-to-es-e052b800-fee7-6vwqv: Client ClusterMaster is not ready (assignment=execution_controller, module=messaging:client, worker_id=_Q15YZQ9, ex_id=c3c188b1-4884-4867-bf86-c126f7b24034, job_id=e052b800-fee7-4af6-994b-3f5a4f78fd31)
    Error: Client ClusterMaster is not ready
        at Client.waitForClientReady (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:99:19)
        at runNextTicks (node:internal/process/task_queues:60:5)
        at process.processTimers (node:internal/timers:509:9)
        at async Socket.<anonymous> (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:74:21)
[2023-11-27T17:49:39.815Z]  WARN: teraslice/10 on ts-exc-kafka-to-es-e052b800-fee7-6vwqv: cluster master did not record the cluster analytics (assignment=execution_controller, module=execution_analytics, worker_id=_Q15YZQ9, ex_id=c3c188b1-4884-4867-bf86-c126f7b24034, job_id=e052b800-fee7-4af6-994b-3f5a4f78fd31)

This also causes a timeout error on the master pod server:

[2023-11-08T17:44:47.677Z]  INFO: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: execution 8d30d3e0-f1de-4bd8-a040-a276ff75f98f is connected (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=5PZc0y47)
[2023-11-08T17:44:47.703Z]  INFO: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: execution 28906665-2a5d-4da5-8808-287fc70309dd is connected (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=5PZc0y47)
[2023-11-08T17:45:03.577Z]  INFO: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: execution ebb1fc42-e9e3-4631-adba-158bbe35b11a is connected (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=5PZc0y47)
[2023-11-08T17:47:33.024Z] ERROR: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: Timed out after 2m, waiting for message "execution:analytics" (assignment=cluster_master, module=api_service, worker_id=5PZc0y47)
    Error: Timed out after 2m, waiting for message "execution:analytics"
        at Server.handleSendResponse (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:43:19)
        at runNextTicks (node:internal/process/task_queues:60:5)
        at process.processTimers (node:internal/timers:509:9)
        at async Promise.all (index 0)
        at async Object.getControllerStats (/app/source/packages/teraslice/lib/cluster/services/execution.js:264:25)
        at async /app/source/packages/teraslice/lib/utils/api_utils.js:54:28
[2023-11-08T17:47:34.017Z] ERROR: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: Timed out after 2m, waiting for message "execution:analytics" (assignment=cluster_master, module=api_service, worker_id=5PZc0y47)
    Error: Timed out after 2m, waiting for message "execution:analytics"
        at Server.handleSendResponse (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:43:19)
        at runNextTicks (node:internal/process/task_queues:60:5)
        at process.processTimers (node:internal/timers:509:9)
        at async Promise.all (index 0)
        at async Object.getControllerStats (/app/source/packages/teraslice/lib/cluster/services/execution.js:264:25)
        at async /app/source/packages/teraslice/lib/utils/api_utils.js:54:28

This PR now sets serverShutdown back to false when the execution client gets reconnected with the master pod fixing the issue.

godber · 2023-11-27T23:03:19Z

Since there is no test coverage for this, does your manual testing show controller analytics continuing to work /txt/controllers on running jobs after a master restart?

sotojn · 2023-11-27T23:19:44Z

Since there is no test coverage for this, does your manual testing show controller analytics continuing to work /txt/controllers on running jobs after a master restart?

I have curled the txt/controller endpoints before and after the master pod shutdown and it works. I also had ts-top running in the background which did update after the restart

godber · 2023-11-28T01:05:16Z

I've manually confirmed that these changes allow the Teraslice master to be restarted without causing the messaging system to break. I am able to still hit /txt/controllers and there are no errors in the logs.

Here's some hastily gathered notes on my steps:

yarn k8s
earl assets deploy local --bundle terascope/elasticsearch-assets
earl assets deploy local --bundle terascope/standard-assets
earl tjm register local examples/jobs/data_generator.json
earl tjm start examples/jobs/data_generator.json

kubectl get namespaces | grep dev1
services-dev1        Active   12m
ts-dev1              Active   5m12s

kubectl -n ts-dev1 get pods
NAME                                                   READY   STATUS    RESTARTS   AGE
teraslice-master-84d4c87c7b-9vhqr                      1/1     Running   0          5m41s
ts-exc-data-generator-bce93c1e-d1db-lhqln              1/1     Running   0          2m16s
ts-wkr-data-generator-bce93c1e-d1db-5d9d8f7bb6-qczz5   1/1     Running   0          2m14s

kubectl -n ts-dev1 logs -f teraslice-master-84d4c87c7b-9vhqr | bunyan

kubectl -n ts-dev1 delete pod teraslice-master-84d4c87c7b-9vhqr
pod "teraslice-master-84d4c87c7b-9vhqr" deleted

# doesnt return
curl localhost:5678/txt/controllers

And we see the errors

[2023-11-28T00:43:01.792Z] ERROR: teraslice/17 on teraslice-master-84d4c87c7b-n8pb9: Timed out after 2m, waiting for message "execution:analytics" (assignment=cluster_master, module=api_service, worker_id=pYaRZ5mM)
    Error: Timed out after 2m, waiting for message "execution:analytics"
        at Server.handleSendResponse (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:43:19)
        at async Promise.all (index 0)
        at async Object.getControllerStats (/app/source/packages/teraslice/lib/cluster/services/execution.js:264:25)
        at async /app/source/packages/teraslice/lib/cluster/services/api.js:418:27
        at async /app/source/packages/teraslice/lib/utils/api_utils.js:54:28

Teardown

kind delete cluster -n k8se2e
git checkout examples/jobs/data_generator.json

Repeat with the modified code, and it works.

godber

Please bump the patch level version on the teraslice package and I will merge this.

sotojn · 2023-11-28T14:35:29Z

Please bump the patch level version on the teraslice package and I will merge this.

All the following packages have been bumped:

[8667660] Teraslice from v0.87.0 to v0.87.1
[5fa9368] @terascope/teraslice-messaging from v0.34.0 to v0.34.1

change servershutdown to false on reconnect in client

22daf44

sotojn requested review from godber and jsnoble November 27, 2023 22:36

godber requested changes Nov 28, 2023

View reviewed changes

sotojn added 2 commits November 28, 2023 07:22

bump: (patch) @terascope/[email protected]

5fa9368

release: (patch) [email protected]

8667660

godber approved these changes Nov 28, 2023

View reviewed changes

godber merged commit 79fa6a6 into master Nov 28, 2023
39 checks passed

godber deleted the node-18-timeout-fix branch November 28, 2023 15:41

godber temporarily deployed to github-pages November 28, 2023 15:50 — with GitHub Actions Inactive

godber mentioned this pull request Nov 30, 2023

Teraslice Master Restart in Node 18/14 Results in execution:analytics Timeout #3457

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node 18 master pod restart fix #3477

Node 18 master pod restart fix #3477

sotojn commented Nov 27, 2023

godber commented Nov 27, 2023

sotojn commented Nov 27, 2023

godber commented Nov 28, 2023

godber left a comment

sotojn commented Nov 28, 2023

Node 18 master pod restart fix #3477

Node 18 master pod restart fix #3477

Conversation

sotojn commented Nov 27, 2023

godber commented Nov 27, 2023

sotojn commented Nov 27, 2023

godber commented Nov 28, 2023

godber left a comment

Choose a reason for hiding this comment

sotojn commented Nov 28, 2023