Teraslice Master Restart in Node 18/14 Results in execution:analytics Timeout #3457

godber · 2023-11-08T18:28:24Z

With jobs running, when a node 18 teraslice master is restarted, it starts experiencing Error: Timed out after 2m, waiting for message "execution:analytics" for each job. Here is an example of the logs we see:

...
[2023-11-08T17:44:47.677Z]  INFO: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: execution 8d30d3e0-f1de-4bd8-a040-a276ff75f98f is connected (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=5PZc0y47)
[2023-11-08T17:44:47.703Z]  INFO: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: execution 28906665-2a5d-4da5-8808-287fc70309dd is connected (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=5PZc0y47)
[2023-11-08T17:45:03.577Z]  INFO: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: execution ebb1fc42-e9e3-4631-adba-158bbe35b11a is connected (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=5PZc0y47)
[2023-11-08T17:47:33.024Z] ERROR: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: Timed out after 2m, waiting for message "execution:analytics" (assignment=cluster_master, module=api_service, worker_id=5PZc0y47)
    Error: Timed out after 2m, waiting for message "execution:analytics"
        at Server.handleSendResponse (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:43:19)
        at runNextTicks (node:internal/process/task_queues:60:5)
        at process.processTimers (node:internal/timers:509:9)
        at async Promise.all (index 0)
        at async Object.getControllerStats (/app/source/packages/teraslice/lib/cluster/services/execution.js:264:25)
        at async /app/source/packages/teraslice/lib/utils/api_utils.js:54:28
[2023-11-08T17:47:34.017Z] ERROR: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: Timed out after 2m, waiting for message "execution:analytics" (assignment=cluster_master, module=api_service, worker_id=5PZc0y47)
    Error: Timed out after 2m, waiting for message "execution:analytics"
        at Server.handleSendResponse (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:43:19)
        at runNextTicks (node:internal/process/task_queues:60:5)
        at process.processTimers (node:internal/timers:509:9)
        at async Promise.all (index 0)
        at async Object.getControllerStats (/app/source/packages/teraslice/lib/cluster/services/execution.js:264:25)
        at async /app/source/packages/teraslice/lib/utils/api_utils.js:54:28
...

To reproduce in Kubernetes, launch a job, then delete the master pod, when it comes back up you'll start to see these errors.

The text was updated successfully, but these errors were encountered:

godber · 2023-11-08T18:30:02Z

A little more information ....

note you can see the execution connected for each running job
the jobs can be stopped through the teraslice API
restarting the jobs, the errors go away

so there is SOME communication between the master and execution controllers.

godber · 2023-11-30T01:03:32Z

Closed in linked issu.

godber · 2023-11-30T20:19:33Z

We've seen this issue happen in Node 14 builds of Teraslice too. It doesn't appear to happen all the time though. This was on the older code, prior to the fix in #3477

jeffmontagna · 2023-12-15T18:22:43Z

After deploying 0.89.0 to an engineering instance of teraslice in k8s this error is still present after a master restart.

Teraslice cluster had several long running jobs.
Deployed 0.89.0 with log_level to debug. Cluster worked normal.
Updated master to change log_level to info.
Two minutes after restarting the master had the Timed out after 2m, waiting for message error.

Error:

17:45:01.085Z ERROR teraslice: Timed out after 2m, waiting for message "execution:analytics" (assignment=cluster_master, module=api_service, worker_id=XnqDBE4c)
    Error: Timed out after 2m, waiting for message "execution:analytics"
        at Server.handleSendResponse (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:43:19)
        at async Promise.all (index 0)
        at async ExecutionService.getControllerStats (/app/source/packages/teraslice/dist/src/lib/cluster/services/execution.js:274:25)
        at async /app/source/packages/teraslice/dist/src/lib/cluster/services/api.js:461:31
        at async /app/source/packages/teraslice/dist/src/lib/utils/api_utils.js:44:28

txt/controllers returned a 500

$ curl -sS ts-eng1/v1/cluster/controllers
{
    "error": 500,
    "message": "Timed out after 2m, waiting for message \"execution:analytics\""
}

godber · 2023-12-15T20:51:08Z

Disregard the comment above. The master was upgraded to v0.89.0 but the jobs were pinned to a version that didn't have the fix.

It's worth noting that if you're restarting a master, it might take up to five minutes for execution controllers to reconnect. You'll see them in the master logs like this:

[2023-12-15T20:35:57.590Z]  INFO: teraslice/18 on teraslice-foo-master-69f58cc97-c6n9p: execution 0b871a8a-6d45-4af5-a865-af7c0f690425 is connected (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=1ZEO5xLc)

The execution controller's retries back off until it hits a 5m time or something along those lines.

godber added bug pkg/teraslice labels Nov 8, 2023

godber assigned jsnoble, busma13 and sotojn Nov 8, 2023

sotojn mentioned this issue Nov 27, 2023

Node 18 master pod restart fix #3477

Merged

godber closed this as completed Nov 30, 2023

godber changed the title ~~Teraslice Master Restart in Node 18 Results in execution:analytics Timeout~~ Teraslice Master Restart in Node 18/14 Results in execution:analytics Timeout Nov 30, 2023

jeffmontagna reopened this Dec 15, 2023

godber closed this as completed Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Teraslice Master Restart in Node 18/14 Results in execution:analytics Timeout #3457

Teraslice Master Restart in Node 18/14 Results in execution:analytics Timeout #3457

godber commented Nov 8, 2023

godber commented Nov 8, 2023

godber commented Nov 30, 2023

godber commented Nov 30, 2023

jeffmontagna commented Dec 15, 2023 •

edited

Loading

godber commented Dec 15, 2023

Teraslice Master Restart in Node 18/14 Results in execution:analytics Timeout #3457

Teraslice Master Restart in Node 18/14 Results in execution:analytics Timeout #3457

Comments

godber commented Nov 8, 2023

godber commented Nov 8, 2023

godber commented Nov 30, 2023

godber commented Nov 30, 2023

jeffmontagna commented Dec 15, 2023 • edited Loading

godber commented Dec 15, 2023

jeffmontagna commented Dec 15, 2023 •

edited

Loading