Fix context issue during cleanup of kind clusters #6771

jainpulkit22 · 2024-10-25T03:32:32Z

Fix context issue during cleanup of kind clusters.

rajnkamr · 2024-10-25T06:16:42Z

ci/kind/kind-setup.sh

-        creationTimestamp=$(kubectl get nodes --context kind-$kind_cluster_name -o json -l node-role.kubernetes.io/control-plane | \
+    for context in $(kubectl config get-contexts -o name | grep 'kind-'); do
+      cluster_name=$(echo $context | sed 's/^kind-//')
+      if docker ps --format '{{.Names}}' | grep -q "$cluster_name"; then


Does this list all containers in the given cluster name , docker ps --format '{{.Names}}' | grep -q "$cluster_name" ?
docker ps --format '{{.Names}}' | grep '$cluster_name'

Suggested change

if docker ps --format '{{.Names}}' | grep -q "$cluster_name"; then

if docker ps --format '{{.Names}}' | grep '$cluster_name'; then

We don't need to list the cluster names we just want to check if they are present or not, so -q is required here.

antoninbas · 2024-10-28T20:02:36Z

ci/kind/kind-setup.sh

-        creationTimestamp=$(kubectl get nodes --context kind-$kind_cluster_name -o json -l node-role.kubernetes.io/control-plane | \
+    for context in $(kubectl config get-contexts -o name | grep 'kind-'); do
+      cluster_name=$(echo $context | sed 's/^kind-//')
+      if docker ps --format '{{.Names}}' | grep -q "$cluster_name"; then


why are we relying on this docker command instead of kind get clusters?

And kind get nodes can also be used to list all node names.

We are not relying on kind get clusters because it lists those clusters also for which the creation is not yet completed or successful, and is in mid way, which creates a problem because when we try to get contexts of such clusters which are yet not completed it returns an error.
This error we may encounter in case of aborted jobs and multiple jobs run.

In case of aborted job suppose we abort the job as soon as the cluster creation starts, so when we do kind get clusters we will get to see that cluster from aborted job in the list but since its context will not be available so the cleanup will panic and job will fail.

In case of multiple jobs run suppose there are two jobs running parallely and one job has just triggered cluster creation step and the other job triggers the cleanup function so it will list the cluster which is ye not created and then it will try to fetch the context for that cluster and the job will fail because of panic.

Sometimes because of these stale clusters present in the environment testbed also becomes unhealthy.

@jainpulkit22 Thanks for the explanation, my concern is whether docker ps --format '{{.Names}}' | grep -q "$cluster_name" is a sufficient basis to determine if a cluster is ready. Is there something like status.conditions fields in the context that would allow us to accurately determine if the cluster is ready?

I was trying to point out that at this stage we are already iterating over known contexts (for context in $(kubectl config get-contexts -o name | grep 'kind-'), so we know that the context exists. It is not the same as the previous solution, which was first calling kind get clusters, and then calling kubectl config get-contexts.

In your comment and in the original issue, you talk about a "panic", but it's not clear to me what failure you are referring to exactly.

Note that even with your proposed solution, races seem possible:

job 1 calls kubectl config get-contexts and gets context1

job 2 calls kubectl config get-contexts and gets context1

job 1 calls docker ps and finds the container it is looking for

job 2 calls docker ps and finds the container it is looking for

job 1 calls kubectl get nodes --context context1 ... succesfully

job 1 deletes the kind cluster sucessfully

job 2 calls kubectl get nodes --context context1 ... but context1 does not exist anymore!

Either the code should be best effort and tolerate failures, or maybe you should use a lock (see flock) for correctness, and then you don't need to worry about concurrent executions and you can use the most appropriate commands for the task.

In your comment and in the original issue, you talk about a "panic", but it's not clear to me what failure you are referring to exactly.

The failure I talked about in the issue can be explained with the following example.
Suppose there are two jobs running in parallel JobA and JobB, now one of the jobs has just triggered the cluster creation, and the other job say JobB in in the initial cleanup phase, so when it will run the kind get clusters command it will list the cluster for which the creation has just started by JobA, and then when it tries to run the command kubectl get nodes --context ... it will panic because the context does not exist or we can say that cluster creation is yet not completed.

Now, coming to the race condition which you mentioned, I think that this race condition will not be there in practical, because there will be few seconds of time lag between two parallel runs so if one has already deleted the cluster then docker ps will not list that container.
So, the race condition which you mentioned here seems to be a race condition in Ideal scenario but not practically possible.

I still believe that the docker command is unnecessary:

We are not relying on kind get clusters because it lists those clusters also for which the creation is not yet completed or successful, and is in mid way, which creates a problem because when we try to get contexts of such clusters which are yet not completed it returns an error.

The outer loop is still for context in $(kubectl config get-contexts -o name | grep 'kind-'); do, so we already have the context

it will panic

Do you mean a Go panic in kubectl?

So, the race condition which you mentioned here seems to be a race condition in Ideal scenario but not practically possible.

2 jobs can be triggered at the same time. We should probably be tolerant of failures. If the kubectl command (kubectl get nodes) fails or the kind command fails (kind delete cluster --name $cluster_name), we don't have to fail the script.

I still believe that the docker command is unnecessary:

The docker command is necessary here because sometimes the contexts are there in the kubectl config get contexts but the cluster is deleted or is in the creation step.

The outer loop is still for context in $(kubectl config get-contexts -o name | grep 'kind-'); do, so we already have the context

If there is any backlog cluster which is not deleted or created successfully because of the job being aborted or terminated, in this case some context may be present even though there is no running cluster for the same.So, just to be careful while deletion i used an extra check using docker command.

Do you mean a Go panic in kubectl?

Yes

2 jobs can be triggered at the same time. We should probably be tolerant of failures. If the kubectl command (kubectl get nodes) fails or the kind command fails (kind delete cluster --name $cluster_name), we don't have to fail the script.

Okay we can have a fault tolerance for this.

Updated the code, this ensures fault tolerance, I haven't used flock because i guess that would affect he parallel job run because if some job is in cluster creation or is running e2e and the other has acquired flock over the script file then it will not be able to read further data.

XinShuYang · 2024-11-21T14:50:16Z

ci/kind/kind-setup.sh

+      timeout=$(($UNTIL_TIME_IN_MINS*60))
+      if [[ $diff -gt $timeout ]]; then
+        echo "=== Cluster ${cluster_name} is present for more than $UNTIL_TIME_IN_MINS minutes ==="
+        kind delete cluster --name $cluster_name


Considering the race condition, if one job retrieves the cluster name using kubectl config get-contexts -o name before the kind delete cluster deletes the the same cluster in another job, will kubectl get nodes --context $context get stuck or immediately return an error if the cluster has already been deleted?

2 jobs can be triggered at the same time. We should probably be tolerant of failures. If the kubectl command (kubectl get nodes) fails or the kind command fails (kind delete cluster --name $cluster_name), we don't have to fail the script.

@antoninbas I feel we still need a lock to fully avoid race conditions and ensure the cluster information remains consistent during timestamp calculations and cluster deletion 🤔️.

I tend to agree. Honestly I think the core issue is that we try to use kubeconfig and other sources of truth to determine which clusters to delete. We rely on some specific behavior from the tools and we have to tolerate failures. It is not very important in the grand scheme of things as long as we don't corrupt or cause jobs to fail while garbage-collecting clusters. But I do feel like a lot of pain could be avoided by maintaining our own file with a list of clusters, instead of relying on kind or kubeconfig.

Before creating a cluster: write the cluster name and creation timestamp to a file, using a file lock

: "${CLUSTER_NAME:=foobar}" flock .clusters.lock --command "echo \"$CLUSTER_NAME $(date +%s)\" >> .clusters" # kind create cluster --name $CLUSTER_NAME

When doing garbage collection, use the file as the only source of truth, and also use an exclusive file lock

( flock -x 200 current_timestamp=$(date +%s) while IFS=' ' read -r name creationTimestamp; do if [[ -z "$name" || -z "$creationTimestamp" ]]; then continue fi # Calculate the time difference time_difference=$((current_timestamp - creationTimestamp)) # Check if the creation happened more than 1 hour ago (3600 seconds) if (( time_difference > 3600 )); then echo "The creation of $name happened more than 1 hour ago." kind delete cluster --name "$name" || echo "Cluster could not be deleted" else echo "The creation of $name happened within the last hour." echo "$name $creationTimestamp" >> .clusters.swp fi done < .clusters mv .clusters.swp .clusters ) 200>>.clusters.lock

IMO, correctness is more "obvious" this way, and it's not more brittle than relying on kubeconfig.
The directory where we write .clusters and .clusters.lock need to persist across Jenkins jobs.

What do you think @XinShuYang @jainpulkit22

IMO, correctness is more "obvious" this way, and it's not more brittle than relying on kubeconfig. The directory where we write .clusters and .clusters.lock need to persist across Jenkins jobs.

What do you think @XinShuYang @jainpulkit22

Agree, to ensure persistence, the directory for .clusters and .clusters.lock can be set to WORKDIR which is consistent across different Jenkins jobs. We can add a new parameter for this path in the script, making it adaptable to different environments.

Instead of creating a separate file named ".clusters" in the testbed, I have an alternative that we can acquire the lock on the default kubeconfig file so that the other prorcess will not be able to fetch the contexts untill the lock from one is released, so this can prevent the race condition too.
What's your take on this @antoninbas @XinShuYang

Instead of creating a separate file named ".clusters" in the testbed, I have an alternative that we can acquire the lock on the default kubeconfig file so that the other prorcess will not be able to fetch the contexts untill the lock from one is released, so this can prevent the race condition too. What's your take on this @antoninbas @XinShuYang

I am okay with placing this lock in the .kube folder, but could you verify if THIS_DIR is the correct path that is shared by different jobs? Additionally, I recommend using trap to release the lock in case the context it protects exits unexpectedly, to prevent a deadlock issue.

This seems to be based on internal knowledge of K8s and Kind (I assume they use config.lock, otherwise this doesn't make sense), which I am not a fan of.

Additionally, I recommend using trap to release the lock in case the context it protects exits unexpectedly, to prevent a deadlock issue.

This should not be required, the file will be closed if the process stops, and the lock will always be unlocked. As a matter of fact, the manual states that using unlock (flock -u) is very rarely needed.

This should not be required, the file will be closed if the process stops, and the lock will always be unlocked. As a matter of fact, the manual states that using unlock (flock -u) is very rarely needed.

@antoninbas Yes, thanks for correcting me. It won't cause any issues once the job exits.
I forgot to mention another concern: the commands inside the lock run into a long hang or get stuck. Although we do have a global timeout set for the jenkins job, that timeout is usually quite long. This means if the locked section stalls, other jobs have to wait for that long timeout period.
Do you think it‘s beneficial to set a shorter timeout specifically for the locked section, so that if it doesn’t complete in a reasonable time, we can release the lock and exit the job earlier to let other jobs proceed?

We could have a timeout (using flock -w) and if we fail to acquire the lock within a reasonable time, we can skip the cleanup? Does that help?

I don't know if any of the commands are likely to stall in the locked section (once we acquire the lock). Rather than make the code too complicated, maybe we could see if any issue occurs in practice?

ci/kind/kind-setup.sh

antoninbas · 2024-12-21T01:33:43Z

ci/kind/kind-setup.sh

+    LOCK_FILE="$THIS_DIR/.kube/config.lock"
+    (
+      flock -x 200


@jainpulkit22 I'm a bit confused. I first I thought this would work because .kube/config.lock was also used by kind / kubectl. See https://github.com/kubernetes-sigs/kind/blob/1c5a56b30145d98a61c0538975ca0894ad65e0d0/pkg/cluster/internal/kubeconfig/internal/kubeconfig/lock.go#L27

But then I ran some tests and it seems to me that flock and the Go code used to acquire the lock (os.OpenFile(lockName(filename), os.O_CREATE|os.O_EXCL, 0)) are not "compatible". For example, a bash script using flock to acquire an exclusive lock using ~/.kube/config.lock can execute concurrently which a kind command that acquires an excluisve lock on the same file using the code I linked to above.

This is the code I used for testing:

test.go

package main import ( "fmt" "os" "path/filepath" "time" ) // these are from // https://github.com/kubernetes/client-go/blob/611184f7c43ae2d520727f01d49620c7ed33412d/tools/clientcmd/loader.go#L439-L440 func lockFile(filename string) error { // Make sure the dir exists before we try to create a lock file. dir := filepath.Dir(filename) if _, err := os.Stat(dir); os.IsNotExist(err) { if err = os.MkdirAll(dir, 0755); err != nil { return err } } f, err := os.OpenFile(lockName(filename), os.O_CREATE|os.O_EXCL, 0) if err != nil { return err } f.Close() return nil } func unlockFile(filename string) error { return os.Remove(lockName(filename)) } func lockName(filename string) string { return filename + ".lock" } func main() { if err := lockFile("/tmp/.clusters"); err != nil { panic("Failed to lock") } fmt.Println("Sleeping...") time.Sleep(10 * time.Second) fmt.Println("Done") if err := unlockFile("/tmp/.clusters"); err != nil { panic("Failed to unlock") } }

test.sh

( flock -x 200 echo "SLEEPING..." sleep 10 echo "DONE" ) 200>>/tmp/.clusters.lock

Could you confirm that you tested this approach successfully? Maybe there is an issue in my test programs.

If the 2 locking schemes are not compatible, I don't think this patch will have the intended effect, as we can have a concurrent call to kind create cluster which will mutate the Kubeconfig file?

It also seems that because flock will create the file and not delete it, kind will fail to acquire the lock later (file already exists). I tried and got this error from kind:

ERROR: failed to create cluster: failed to lock config file: open /Users/abas/.kube/config.lock: file exists

Finally, even if the locking schemes were compatible, I am not sure that the current version would be sufficient, as kind create cluster will only acquire the lock while it's writing the config: therefore, maybe we could still have a context in the config for which the control plane is not ready yet?

I have explicitly added the command to remove the lock file, other than that it seems compatible to me, because when I ried locally b oth the scripts together the create script waited for the lock to be released by the delete script.

Regarding your second question that we could still have a context in the config for which the control plaane is not ready yet, taht won't be a problem for use because the output of creation timeSTamp is piped to true, so the script won't panic and will run as usual.

I have explicitly added the command to remove the lock file, other than that it seems compatible to me, because when I ried locally b oth the scripts together the create script waited for the lock to be released by the delete script.

Please share your steps, this is not what I observe in my case. test.sh and test.go can execute concurrently (test.sh is supposed to represent the clean up step in kind-setup.sh here, using the same flock command).

I didn't use test.go for my local testing I did the following:
I created a test.sh script similar to the above one and executed kind create cluster command in one shell and test.sh script in another cell, and there I din't got any error.
I did this because in actual test scenario also we are just making calls to kind create cluster command.

I created a test.sh script similar to the above one and executed kind create cluster command in one shell and test.sh script in another cell, and there I din't got any error.

That doesn't show anything.
You would need the following test perhaps:

test.sh with a long sleep while the exclusive lock is held (a few minutes maybe)

try running kind create cluster in a second terminal

If kind create cluster can succeed before test.sh releases the lock, then you know it's not working.

Signed-off-by: Pulkit Jain <[email protected]>

rajnkamr reviewed Oct 25, 2024

View reviewed changes

rajnkamr added this to the Antrea v2.3 release milestone Oct 25, 2024

jainpulkit22 requested review from antoninbas and XinShuYang October 28, 2024 05:45

antoninbas reviewed Oct 28, 2024

View reviewed changes

rajnkamr added the area/test/infra Issues or PRs related to test infrastructure (Jenkins configuration, Ansible playbook, Kind wrappers label Nov 14, 2024

jainpulkit22 force-pushed the cleanup-function-fix branch from c7011d3 to 057ba01 Compare November 21, 2024 06:07

jainpulkit22 requested a review from antoninbas November 21, 2024 06:09

XinShuYang reviewed Nov 21, 2024

View reviewed changes

jainpulkit22 force-pushed the cleanup-function-fix branch from 057ba01 to 8612cb2 Compare December 13, 2024 05:52

jainpulkit22 requested review from XinShuYang and antoninbas and removed request for antoninbas December 13, 2024 05:56

antoninbas reviewed Dec 16, 2024

View reviewed changes

ci/kind/kind-setup.sh Outdated Show resolved Hide resolved

jainpulkit22 force-pushed the cleanup-function-fix branch from 8612cb2 to e84b983 Compare December 18, 2024 07:33

jainpulkit22 requested a review from antoninbas December 18, 2024 07:33

antoninbas reviewed Dec 21, 2024

View reviewed changes

Fix context issue during cleanup of kind clusters

560fe65

Signed-off-by: Pulkit Jain <[email protected]>

jainpulkit22 force-pushed the cleanup-function-fix branch from e84b983 to 560fe65 Compare December 23, 2024 06:22

jainpulkit22 requested a review from antoninbas January 2, 2025 04:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix context issue during cleanup of kind clusters #6771

Fix context issue during cleanup of kind clusters #6771

jainpulkit22 commented Oct 25, 2024

rajnkamr Oct 25, 2024 •

edited

Loading

jainpulkit22 Oct 28, 2024

antoninbas Oct 28, 2024

XinShuYang Oct 29, 2024

jainpulkit22 Oct 29, 2024

jainpulkit22 Oct 29, 2024

XinShuYang Oct 29, 2024

antoninbas Oct 29, 2024

jainpulkit22 Nov 20, 2024

antoninbas Nov 20, 2024

jainpulkit22 Nov 21, 2024 •

edited

Loading

jainpulkit22 Nov 21, 2024 •

edited

Loading

XinShuYang Nov 21, 2024 •

edited

Loading

XinShuYang Nov 21, 2024

antoninbas Nov 21, 2024

XinShuYang Nov 22, 2024

jainpulkit22 Dec 13, 2024

XinShuYang Dec 16, 2024

jainpulkit22 Dec 16, 2024

antoninbas Dec 16, 2024

XinShuYang Dec 18, 2024 •

edited

Loading

antoninbas Dec 21, 2024

antoninbas Dec 21, 2024

jainpulkit22 Dec 23, 2024

antoninbas Dec 23, 2024

jainpulkit22 Dec 26, 2024

antoninbas Jan 2, 2025

	if docker ps --format '{{.Names}}' \| grep -q "$cluster_name"; then
	if docker ps --format '{{.Names}}' \| grep '$cluster_name'; then

Fix context issue during cleanup of kind clusters #6771

Are you sure you want to change the base?

Fix context issue during cleanup of kind clusters #6771

Conversation

jainpulkit22 commented Oct 25, 2024

rajnkamr Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jainpulkit22 Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

jainpulkit22 Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

XinShuYang Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XinShuYang Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rajnkamr Oct 25, 2024 •

edited

Loading

jainpulkit22 Nov 21, 2024 •

edited

Loading

jainpulkit22 Nov 21, 2024 •

edited

Loading

XinShuYang Nov 21, 2024 •

edited

Loading

XinShuYang Dec 18, 2024 •

edited

Loading