-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node 18 master pod restart fix #3477
Conversation
Since there is no test coverage for this, does your manual testing show controller analytics continuing to work |
I have curled the |
I've manually confirmed that these changes allow the Teraslice master to be restarted without causing the messaging system to break. I am able to still hit Here's some hastily gathered notes on my steps: yarn k8s
earl assets deploy local --bundle terascope/elasticsearch-assets
earl assets deploy local --bundle terascope/standard-assets
earl tjm register local examples/jobs/data_generator.json
earl tjm start examples/jobs/data_generator.json kubectl get namespaces | grep dev1
services-dev1 Active 12m
ts-dev1 Active 5m12s kubectl -n ts-dev1 get pods
NAME READY STATUS RESTARTS AGE
teraslice-master-84d4c87c7b-9vhqr 1/1 Running 0 5m41s
ts-exc-data-generator-bce93c1e-d1db-lhqln 1/1 Running 0 2m16s
ts-wkr-data-generator-bce93c1e-d1db-5d9d8f7bb6-qczz5 1/1 Running 0 2m14s kubectl -n ts-dev1 logs -f teraslice-master-84d4c87c7b-9vhqr | bunyan kubectl -n ts-dev1 delete pod teraslice-master-84d4c87c7b-9vhqr
pod "teraslice-master-84d4c87c7b-9vhqr" deleted
And we see the errors
Teardown kind delete cluster -n k8se2e
git checkout examples/jobs/data_generator.json Repeat with the modified code, and it works. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please bump the patch level version on the teraslice package and I will merge this.
I have fixed a bug that happens on node 18 that is addressed here #3457
The issue was that the execution client has a variable called
serverShutdown
that gets set totrue
when the masterpod is told to shutdown. But when the master pod is booted back up and reconnects with the client, the client can no longer send execution analytics to the master pod causing the following error.This also causes a timeout error on the master pod server:
This PR now sets
serverShutdown
back to false when the execution client gets reconnected with the master pod fixing the issue.