Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Teraslice Master Restart in Node 18/14 Results in execution:analytics Timeout #3457

Closed
godber opened this issue Nov 8, 2023 · 5 comments
Closed

Comments

@godber
Copy link
Member

godber commented Nov 8, 2023

With jobs running, when a node 18 teraslice master is restarted, it starts experiencing Error: Timed out after 2m, waiting for message "execution:analytics" for each job. Here is an example of the logs we see:

...
[2023-11-08T17:44:47.677Z]  INFO: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: execution 8d30d3e0-f1de-4bd8-a040-a276ff75f98f is connected (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=5PZc0y47)
[2023-11-08T17:44:47.703Z]  INFO: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: execution 28906665-2a5d-4da5-8808-287fc70309dd is connected (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=5PZc0y47)
[2023-11-08T17:45:03.577Z]  INFO: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: execution ebb1fc42-e9e3-4631-adba-158bbe35b11a is connected (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=5PZc0y47)
[2023-11-08T17:47:33.024Z] ERROR: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: Timed out after 2m, waiting for message "execution:analytics" (assignment=cluster_master, module=api_service, worker_id=5PZc0y47)
    Error: Timed out after 2m, waiting for message "execution:analytics"
        at Server.handleSendResponse (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:43:19)
        at runNextTicks (node:internal/process/task_queues:60:5)
        at process.processTimers (node:internal/timers:509:9)
        at async Promise.all (index 0)
        at async Object.getControllerStats (/app/source/packages/teraslice/lib/cluster/services/execution.js:264:25)
        at async /app/source/packages/teraslice/lib/utils/api_utils.js:54:28
[2023-11-08T17:47:34.017Z] ERROR: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: Timed out after 2m, waiting for message "execution:analytics" (assignment=cluster_master, module=api_service, worker_id=5PZc0y47)
    Error: Timed out after 2m, waiting for message "execution:analytics"
        at Server.handleSendResponse (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:43:19)
        at runNextTicks (node:internal/process/task_queues:60:5)
        at process.processTimers (node:internal/timers:509:9)
        at async Promise.all (index 0)
        at async Object.getControllerStats (/app/source/packages/teraslice/lib/cluster/services/execution.js:264:25)
        at async /app/source/packages/teraslice/lib/utils/api_utils.js:54:28
...

To reproduce in Kubernetes, launch a job, then delete the master pod, when it comes back up you'll start to see these errors.

@godber
Copy link
Member Author

godber commented Nov 8, 2023

A little more information ....

  • note you can see the execution connected for each running job
  • the jobs can be stopped through the teraslice API
  • restarting the jobs, the errors go away

so there is SOME communication between the master and execution controllers.

@godber
Copy link
Member Author

godber commented Nov 30, 2023

Closed in linked issu.

@godber godber closed this as completed Nov 30, 2023
@godber godber changed the title Teraslice Master Restart in Node 18 Results in execution:analytics Timeout Teraslice Master Restart in Node 18/14 Results in execution:analytics Timeout Nov 30, 2023
@godber
Copy link
Member Author

godber commented Nov 30, 2023

We've seen this issue happen in Node 14 builds of Teraslice too. It doesn't appear to happen all the time though. This was on the older code, prior to the fix in #3477

@jeffmontagna
Copy link
Contributor

jeffmontagna commented Dec 15, 2023

After deploying 0.89.0 to an engineering instance of teraslice in k8s this error is still present after a master restart.

  • Teraslice cluster had several long running jobs.
  • Deployed 0.89.0 with log_level to debug. Cluster worked normal.
  • Updated master to change log_level to info.
  • Two minutes after restarting the master had the Timed out after 2m, waiting for message error.

Error:

17:45:01.085Z ERROR teraslice: Timed out after 2m, waiting for message "execution:analytics" (assignment=cluster_master, module=api_service, worker_id=XnqDBE4c)
    Error: Timed out after 2m, waiting for message "execution:analytics"
        at Server.handleSendResponse (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:43:19)
        at async Promise.all (index 0)
        at async ExecutionService.getControllerStats (/app/source/packages/teraslice/dist/src/lib/cluster/services/execution.js:274:25)
        at async /app/source/packages/teraslice/dist/src/lib/cluster/services/api.js:461:31
        at async /app/source/packages/teraslice/dist/src/lib/utils/api_utils.js:44:28
  • txt/controllers returned a 500
$ curl -sS ts-eng1/v1/cluster/controllers
{
    "error": 500,
    "message": "Timed out after 2m, waiting for message \"execution:analytics\""
}

@jeffmontagna jeffmontagna reopened this Dec 15, 2023
@godber
Copy link
Member Author

godber commented Dec 15, 2023

Disregard the comment above. The master was upgraded to v0.89.0 but the jobs were pinned to a version that didn't have the fix.

It's worth noting that if you're restarting a master, it might take up to five minutes for execution controllers to reconnect. You'll see them in the master logs like this:

[2023-12-15T20:35:57.590Z]  INFO: teraslice/18 on teraslice-foo-master-69f58cc97-c6n9p: execution 0b871a8a-6d45-4af5-a865-af7c0f690425 is connected (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=1ZEO5xLc)

The execution controller's retries back off until it hits a 5m time or something along those lines.

@godber godber closed this as completed Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants