Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node 18 master pod restart fix #3477

Merged
merged 3 commits into from
Nov 28, 2023
Merged

Node 18 master pod restart fix #3477

merged 3 commits into from
Nov 28, 2023

Conversation

sotojn
Copy link
Contributor

@sotojn sotojn commented Nov 27, 2023

I have fixed a bug that happens on node 18 that is addressed here #3457

The issue was that the execution client has a variable called serverShutdown that gets set to true when the masterpod is told to shutdown. But when the master pod is booted back up and reconnects with the client, the client can no longer send execution analytics to the master pod causing the following error.

[2023-11-27T17:49:38.468Z] ERROR: teraslice/10 on ts-exc-kafka-to-es-e052b800-fee7-6vwqv: Client ClusterMaster is not ready (assignment=execution_controller, module=messaging:client, worker_id=_Q15YZQ9, ex_id=c3c188b1-4884-4867-bf86-c126f7b24034, job_id=e052b800-fee7-4af6-994b-3f5a4f78fd31)
    Error: Client ClusterMaster is not ready
        at Client.waitForClientReady (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:99:19)
        at runNextTicks (node:internal/process/task_queues:60:5)
        at process.processTimers (node:internal/timers:509:9)
        at async Socket.<anonymous> (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:74:21)
[2023-11-27T17:49:39.815Z]  WARN: teraslice/10 on ts-exc-kafka-to-es-e052b800-fee7-6vwqv: cluster master did not record the cluster analytics (assignment=execution_controller, module=execution_analytics, worker_id=_Q15YZQ9, ex_id=c3c188b1-4884-4867-bf86-c126f7b24034, job_id=e052b800-fee7-4af6-994b-3f5a4f78fd31)

This also causes a timeout error on the master pod server:

[2023-11-08T17:44:47.677Z]  INFO: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: execution 8d30d3e0-f1de-4bd8-a040-a276ff75f98f is connected (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=5PZc0y47)
[2023-11-08T17:44:47.703Z]  INFO: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: execution 28906665-2a5d-4da5-8808-287fc70309dd is connected (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=5PZc0y47)
[2023-11-08T17:45:03.577Z]  INFO: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: execution ebb1fc42-e9e3-4631-adba-158bbe35b11a is connected (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=5PZc0y47)
[2023-11-08T17:47:33.024Z] ERROR: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: Timed out after 2m, waiting for message "execution:analytics" (assignment=cluster_master, module=api_service, worker_id=5PZc0y47)
    Error: Timed out after 2m, waiting for message "execution:analytics"
        at Server.handleSendResponse (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:43:19)
        at runNextTicks (node:internal/process/task_queues:60:5)
        at process.processTimers (node:internal/timers:509:9)
        at async Promise.all (index 0)
        at async Object.getControllerStats (/app/source/packages/teraslice/lib/cluster/services/execution.js:264:25)
        at async /app/source/packages/teraslice/lib/utils/api_utils.js:54:28
[2023-11-08T17:47:34.017Z] ERROR: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: Timed out after 2m, waiting for message "execution:analytics" (assignment=cluster_master, module=api_service, worker_id=5PZc0y47)
    Error: Timed out after 2m, waiting for message "execution:analytics"
        at Server.handleSendResponse (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:43:19)
        at runNextTicks (node:internal/process/task_queues:60:5)
        at process.processTimers (node:internal/timers:509:9)
        at async Promise.all (index 0)
        at async Object.getControllerStats (/app/source/packages/teraslice/lib/cluster/services/execution.js:264:25)
        at async /app/source/packages/teraslice/lib/utils/api_utils.js:54:28

This PR now sets serverShutdown back to false when the execution client gets reconnected with the master pod fixing the issue.

@sotojn sotojn requested review from godber and jsnoble November 27, 2023 22:36
@godber
Copy link
Member

godber commented Nov 27, 2023

Since there is no test coverage for this, does your manual testing show controller analytics continuing to work /txt/controllers on running jobs after a master restart?

@sotojn
Copy link
Contributor Author

sotojn commented Nov 27, 2023

Since there is no test coverage for this, does your manual testing show controller analytics continuing to work /txt/controllers on running jobs after a master restart?

I have curled the txt/controller endpoints before and after the master pod shutdown and it works. I also had ts-top running in the background which did update after the restart

@godber
Copy link
Member

godber commented Nov 28, 2023

I've manually confirmed that these changes allow the Teraslice master to be restarted without causing the messaging system to break. I am able to still hit /txt/controllers and there are no errors in the logs.

Here's some hastily gathered notes on my steps:

yarn k8s
earl assets deploy local --bundle terascope/elasticsearch-assets
earl assets deploy local --bundle terascope/standard-assets
earl tjm register local examples/jobs/data_generator.json
earl tjm start examples/jobs/data_generator.json
kubectl get namespaces | grep dev1
services-dev1        Active   12m
ts-dev1              Active   5m12s
kubectl -n ts-dev1 get pods
NAME                                                   READY   STATUS    RESTARTS   AGE
teraslice-master-84d4c87c7b-9vhqr                      1/1     Running   0          5m41s
ts-exc-data-generator-bce93c1e-d1db-lhqln              1/1     Running   0          2m16s
ts-wkr-data-generator-bce93c1e-d1db-5d9d8f7bb6-qczz5   1/1     Running   0          2m14s
kubectl -n ts-dev1 logs -f teraslice-master-84d4c87c7b-9vhqr | bunyan
kubectl -n ts-dev1 delete pod teraslice-master-84d4c87c7b-9vhqr
pod "teraslice-master-84d4c87c7b-9vhqr" deleted
# doesnt return
curl localhost:5678/txt/controllers

And we see the errors

[2023-11-28T00:43:01.792Z] ERROR: teraslice/17 on teraslice-master-84d4c87c7b-n8pb9: Timed out after 2m, waiting for message "execution:analytics" (assignment=cluster_master, module=api_service, worker_id=pYaRZ5mM)
    Error: Timed out after 2m, waiting for message "execution:analytics"
        at Server.handleSendResponse (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:43:19)
        at async Promise.all (index 0)
        at async Object.getControllerStats (/app/source/packages/teraslice/lib/cluster/services/execution.js:264:25)
        at async /app/source/packages/teraslice/lib/cluster/services/api.js:418:27
        at async /app/source/packages/teraslice/lib/utils/api_utils.js:54:28

Teardown

kind delete cluster -n k8se2e
git checkout examples/jobs/data_generator.json

Repeat with the modified code, and it works.

Copy link
Member

@godber godber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please bump the patch level version on the teraslice package and I will merge this.

@sotojn
Copy link
Contributor Author

sotojn commented Nov 28, 2023

Please bump the patch level version on the teraslice package and I will merge this.

All the following packages have been bumped:

  • [8667660] Teraslice from v0.87.0 to v0.87.1
  • [5fa9368] @terascope/teraslice-messaging from v0.34.0 to v0.34.1

@godber godber merged commit 79fa6a6 into master Nov 28, 2023
39 checks passed
@godber godber deleted the node-18-timeout-fix branch November 28, 2023 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants