Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KIT Guest Clusters ETCD pods unable to become ready when using Karpenter v0.11.0+ #241

Open
njtran opened this issue Jun 30, 2022 · 2 comments
Labels
0.2 alpha bug Something isn't working

Comments

@njtran
Copy link

njtran commented Jun 30, 2022

When using Karpenter version v0.12.1, using a Provisioner that has a small ttlSecondsAfterEmpty can result in removing the node that an ETCD replica is scheduled to. The associated PVC for the ETCD pod then never binds to the volume.

The initial thought is since Karpenter does not pre-bind to pods anymore after v0.11.0, this may introduce some undesired behavior for the EBS Volumes on the ETCD instances.

The nodes for the PVCs for ETCD 1 and 2 were never deleted. Even though Karpenter brought up a replacement node for ETCD 0, the pod and PVC never binded

NAME                                 STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
etcd-data-kit-guest-cluster-etcd-0   Pending                                                                        kit-gp3        23m
etcd-data-kit-guest-cluster-etcd-1   Bound     pvc-f993bc10-327d-4414-a3b2-acc95d956ab0   40Gi       RWO            kit-gp3        23m
etcd-data-kit-guest-cluster-etcd-2   Bound     pvc-f781eb15-4bfb-4395-b2c1-52b53c54a63a   40Gi       RWO            kit-gp3        23m

➜  k get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                             STORAGECLASS   REASON   AGE
pvc-f781eb15-4bfb-4395-b2c1-52b53c54a63a   40Gi       RWO            Delete           Bound    tekton-tests/etcd-data-kit-guest-cluster-etcd-2   kit-gp3                 22m
pvc-f993bc10-327d-4414-a3b2-acc95d956ab0   40Gi       RWO            Delete           Bound    tekton-tests/etcd-data-kit-guest-cluster-etcd-1   kit-gp3                 22m

This is the error message when describing the ETCD pod.

  Warning  FailedScheduling  20m                 default-scheduler  running PreBind plugin "VolumeBinding": binding volumes: failed to get node "ip-192-168-86-66.us-west-2.compute.internal": node "ip-192-168-86-66.us-west-2.compute.internal" not found
  Warning  FailedScheduling  16m (x2 over 18m)   default-scheduler  (combined from similar events): 0/5 nodes are available: 1 node(s) didn't find available persistent volumes to bind, 2 node(s) didn't have free ports for the requested pod ports, 2 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  15m (x6 over 19m)   default-scheduler  0/5 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 1 Too many pods, 2 node(s) didn't have free ports for the requested pod ports, 2 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  15m (x6 over 19m)   default-scheduler  0/5 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 2 node(s) didn't have free ports for the requested pod ports, 2 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  11m (x5 over 19m)   default-scheduler  0/4 nodes are available: 2 node(s) didn't have free ports for the requested pod ports, 2 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  97s (x20 over 18m)  default-scheduler  0/5 nodes are available: 1 node(s) didn't find available persistent volumes to bind, 2 node(s) didn't have free ports for the requested pod ports, 2 node(s) didn't match Pod's node affinity/selector.
@tzneal
Copy link

tzneal commented Jun 30, 2022

From what I've seen, this is caused by the PVC having a selected node annotation for a node that was deleted.

running PreBind plugin "VolumeBinding": binding volumes: failed to get node "ip-192-168-86-66.us-we
st-2.compute.internal": node "ip-192-168-86-66.us-west-2.compute.internal" not found

@prateekgogia
Copy link
Member

So if the first initial node is not deleted by Karpenter does it work?

@prateekgogia prateekgogia added the 0.2 alpha label Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.2 alpha bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants