-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the execution controller restartable on kafka in k8s #1007
Comments
I think the question of, "How do workers find the restarted execution controller?" Is answered by the use of k8s services. The workers already access the execution controller through a service, this service will route traffic to any execution controller pod that matches a specific criteria. |
Also, this doesn't have to be discussed or resolve now, I just wanted to get the issue posted, since it would be a huge win in kubernetes. |
There are a number of things that would need to change in the execution controller, particularly around the execution state and what happens when it restarts. The execution should update the state to a “paused,” a new “suspended” state, or something similar. The slicer would have to be able to specify that it stores no critical state and can be restartable. The workers will need to be notified by the execution controller that it is “suspended” so that it doesn’t shut down when the execution controller shutdowns. |
This basically requires HA support for execution controllers. |
At this point we're experimenting with how things work and will come up with a proposed solution. After Joseph does some experimenting and code reading, we should all talk with Jared since he has broader/deeper understanding of the existing imeplementations. |
After looking into this and testing some jobs using the kubernetesV2 backend I feel I have an understanding of what would be lost in memory and the effects of restarting an execution pod. For context, the Here is a list of data in memory that the execution controller will lose when restarted:
I next want to list all the slicers and what will be lost in memory. From what I tested, it seems that in the case of an execution restart, it will boot up as if it were it's first time booting up and start from the beginning. I made a I'm also looking into how |
Just to confirm and document the behavior of restarting an execution controller, I created a Job file:
Midway through the running the execution I force shut it down. The worker logs show this when the ex_controller pod dies:
It will briefly disconnect and reconnect and continue. I've verifed that the consumer group completes it offset and finishes before stopping the job. It seems it takes
Note that I killed the ex_controller in a way that would not allow the |
Yeah, that's an important point. |
What I need to do is modify the execution shutdown code to check to see if the execution status is changed to |
One of the main goals here is to allow We have two scenarios we want to support, when an execution controller:
Let's concentrate on solving those "process shutdown" problems without confusing them with "job shutdown" for Kafka only since that's the easy case. Then we can deal with slicers that may need changing. An open question is, in case |
I have made changes to the execution controller that will shutdown in one of 2 ways in the event of a
I did some tests with this new logic on a persistent job that read from |
I've went through the four primary assets that we maintain and have listed all the slicers: elasticsearch-assets:
file-assets:
I want to look into why standard-assets:
kafka-assets:
|
I had a realization that we're kind of confusing I think we will introduce an |
I have resolved most of the issues related to V2 and have all of this logic only applied to V2. It should not changes how V1 operates |
I've just realized that you're going to have to make changes outside of the kubernetes backend so sticking to |
…3740) This PR makes the following changes: ## New features - Teraslice running in `KubernetesV2` will have the ability to restart in a new pod automatically under certain conditions - These conditions include receiving a `SIGTERM` signal not associated with a job shutdown, and the slicer being restartable. - Added new function `isRestartable()` to the base slicer - `isRestartable()` will return a `boolean` to tell wether the slicer is compatible with the restartable feature or not. This allows for an initial implementation for `kafka` slicers without having to worry about the complexity of other slicers. refs: #1007
I think it should be possible to make the execution controller restartable in the case of kafka readers and in k8s. I guess there's the question of persisting the in memory state in the execution controller, but it can't be that much in the case of kafka.
@peterdemartini maybe you can point me in the right direction. Or maybe this should be for you to do.
The text was updated successfully, but these errors were encountered: