-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Teraslice Master Restart in Node 18/14 Results in execution:analytics Timeout #3457
Comments
A little more information ....
so there is SOME communication between the master and execution controllers. |
Closed in linked issu. |
We've seen this issue happen in Node 14 builds of Teraslice too. It doesn't appear to happen all the time though. This was on the older code, prior to the fix in #3477 |
After deploying
Error:
|
Disregard the comment above. The master was upgraded to It's worth noting that if you're restarting a master, it might take up to five minutes for execution controllers to reconnect. You'll see them in the master logs like this:
The execution controller's retries back off until it hits a 5m time or something along those lines. |
With jobs running, when a node 18 teraslice master is restarted, it starts experiencing
Error: Timed out after 2m, waiting for message "execution:analytics"
for each job. Here is an example of the logs we see:To reproduce in Kubernetes, launch a job, then delete the master pod, when it comes back up you'll start to see these errors.
The text was updated successfully, but these errors were encountered: