Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overlap with https://cluster-api.sigs.k8s.io/ #197

Open
cgwalters opened this issue Apr 22, 2024 · 15 comments
Open

overlap with https://cluster-api.sigs.k8s.io/ #197

cgwalters opened this issue Apr 22, 2024 · 15 comments

Comments

@cgwalters
Copy link

There's some logical overlap here with https://cluster-api.sigs.k8s.io/ btw that seems like it'd be good to at least think through.

As well as https://www.redhat.com/en/blog/learn-about-red-hat-peer-pods-openshift-sandboxed-containers

@cgwalters
Copy link
Author

I'm coming here btw after seeing #194 which I was pointed at after hitting race conditions in my project's jobs.

The cluster API code is heavily battle tested against exactly things like this - how and when to retry underlying cloud infra API requests, how to handle auth, etc.

@stuartwdouglas
Copy link
Contributor

That looks like it is more designed to allocate whole clusters, can it be used to allocate individual VMs? I had a quick poke around but it was not immediately obvious if this was possible.

@cgwalters
Copy link
Author

cgwalters commented Apr 24, 2024 via email

@brianwcook
Copy link

Trying to resolve this old thread. @arewm I remember you investigated peer pods and found that we couldn't use it (at that time, anyway). Do you remember why? It would be good to have it here.

@arewm
Copy link
Member

arewm commented Sep 11, 2024

I looked into both peer pods and CAPI previously. At the time that work started on the multi-platform controller, CAPI did not support provisioning resources on IBM Cloud for s390x.

While peer pods has many similarities to the architecture for the multi-platform controller, there were also limitations to support for IBM Cloud provisioning as well as for supporting syncing data from PVCs. I feel like most of these issues should be resolved if we use the community version of peer pods, but they might not all be supported if we use the Red Hat version.

@cgwalters
Copy link
Author

My take though is that even if there were missing features from one or two of those other codebases, it would be less overall long term maintenance burden to carry a fork that adds whatever changes are needed than to have a completely new codebase.

Specifically, using either CAPI or peer pods we'd get support for a ton of major public clouds instead of being tied to AWS as this codebase is today.

@arewm
Copy link
Member

arewm commented Sep 11, 2024

I agree. I think that we should try to reuse upstream projects within Konflux-CI instead of inventing our own solutions.

I didn't continue to look at CAPI after the initial investigation. I have been trying to keep a pulse on the use of cloud-api-adaptor from Kata (i.e. peer pods) as I feel like the approach is consistent to the one that was implemented with the multi-platform controller.

@brianwcook
Copy link

@arewm I agree, we should not reinvent things when we can avoid it.

@ifireball has taken over as primary maintainer of this repo. Barak, do you want to investigate alternatives to scheduling jobs using CAPI or peer pods? I am not worried about downstream, attack it for Konflux community using upstream.

@cgwalters
Copy link
Author

I forgot to mention earlier but since I think it's relevant: I didn't just randomly come to this repository and look at the code. At some point, I was debugging a CI failure which looked very much like flakes in "ssh to machine to perform task" that are part of what this project is doing.

Whereas I think a more Kubernetes-native model would look more like scheduling a pod and driving it to completion and monitoring its status asynchronously, which gets enabled with a peer-pod like model.

@ifireball
Copy link
Member

ifireball commented Sep 12, 2024

TL;DR: Other solutions are on our radar, but we have things we must do before we can look more seriously at them

AFAIK the cluster API is merely a pending API standard without solid implementations we could use at this point. Alex looked into it when he was designing our EaaS solution, and decided to go with something else at this point in time.

We had a chat with the folks working on Kata peer pods - which is probably he most relevant option for our multi-platform support. As far as I could tell, up until chatting with us, they had assumed all the PODs in a cluster would be running on the same architecture - and that wouldn't fit multi-arch builds.

I don't know wither they revised their plans since.

In any case we have some other priorities that had to do with stabilizing and building support around our existing solution, since its already running in production, we seriously can't look at other solutions until we are done with those.

@brianwcook
Copy link

Barak you are helping me remember - cluster API does look good but in addition to it being very new, it was an alpha feature that we would not be able to enable on either of our target platforms (EKS, OpenShift).

@arewm
Copy link
Member

arewm commented Sep 12, 2024

I created a tracker for using kata containers and the cloud-api-adaptor in the Konflux-ci upstream: https://issues.redhat.com/browse/KONFLUX-4358.

@brianwcook
Copy link

I would like to make more use of Kata; unfortunately deployments of Konflux on AWS can't use it because most AWS VMs do not support nested virt.

@arewm
Copy link
Member

arewm commented Sep 12, 2024

@brianwcook, as I understand it, Kata is just about running pods in a VM for further sandboxing. There are multiple modes of operation of kata (runtime classes). One is with qemu which requires either baremetal or nested virtualization. This is not supported on AWS as you indicated. Another mode of operation is to use peer pods which leverages the cloud-api-adaptor to provision new VMs.

While qemu will not work, we can still leverage kata within our deployments with the peer pod approach. This would just spin up new VMs for pods in to run workloads. These VMs can either be on the same architecture to enable sandboxed environments that need elevated privileges or they can be on different architectures to enable mutli-architecture builds.

@brianwcook
Copy link

That makes sense. Somewhat off topic but related, we could try using peer pods to do disk image generation. We are currently using multi-platform controller for it because we need root access to filesystem stuff. Still doesn't solve for multi-arch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants