Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce Infrastructure Spending #2156

Open
afrittoli opened this issue Sep 6, 2024 · 3 comments · Fixed by #2245
Open

Reduce Infrastructure Spending #2156

afrittoli opened this issue Sep 6, 2024 · 3 comments · Fixed by #2245
Assignees

Comments

@afrittoli
Copy link
Member

We need to monitor infrastructure spending and make sure we reduce costs where possible.
Currently, our spending breakdown looks like this:

image
@afrittoli
Copy link
Member Author

The cloud storage spending is associated mostly with the container registry image download bandwidth. We will have to migrate to the artifact registry soon, but that will not reduce cost.

@afrittoli
Copy link
Member Author

afrittoli commented Sep 6, 2024

Areas where we could reduce spending:

  • Cloud Storage / Container Registry: move released container images to ghcr.io. This will shift over time the traffic from gcr.io to ghcr.io, as Tekton users move to the latest releases.
    • Knowing the source of the egress traffic would help optimise this work, unfortunately, it is currently not possible to find that out from Google cloud monitoring tools.
    • It is possible that some or most of the traffic may be generated by CI systems of Tekton or other projects that use Tekton. If this is true, even moving new releases might alleviate the cost.
  • Compute Engine: this is made of a combination of running services and jobs (CI/CD pipelines). The data needs further analysis; some initial considerations:
    • Prow: this is used by most CI jobs today, and it cannot be removed easily. The CPU/memory consumption of the control plane seems contained. The test jobs name space is where most of the CPU requests and consumptions are spent. We can review existing CI jobs to see if there is any optimisation work that can be done. We have done work in the past to optimise these, using kind for e2e tests, with a nodepool that scales down to zero when not used.
    • Tekton (dogfooding): this is used by all CD pipelines, so it cannot be removed easily. The CPU/memory consumption of the control plane seems contained.
    • Messaging (knative + kafka): this is used to transport events from Tekton pipelines to Tekton Event Listeners. Since we only have one consumer (the tekton-events event listener) we could remove knative + kafka and rely on 1:1 delivery of events. Kafka is one of the top consumers of memory among the various services deployed, so it may be worth removing it even if it means losing some reliability in the delivery of events
    • Nightly builds: these use decent amounts of CPU and memory, even if only for 30m each
  • Container Registry vulnerability scanning: we could disable this. We use minimal base images which are already scanned elsewhere, and we perform various types of security scanning on the code in each PR, which may be sufficient
    • This is disabled now
  • Robocat cluster: today, this is used mainly to deploy services to dogfooding via Tekton pipelines. We should find an alternative deployment solution and remove the robocat cluster: Remove the robocat cluster #2255

Other services are negligible and not worth looking into for now.

@afrittoli
Copy link
Member Author

This PR enables removing tests from nightly builds: tektoncd/pipeline#8252

@afrittoli afrittoli self-assigned this Oct 10, 2024
afrittoli added a commit to afrittoli/plumbing that referenced this issue Oct 10, 2024
Stop sending events to kafka, send them directly to the listener.
Next step will be to deprovision the kafka cluster, to help
reducing infra costs.

Partially-fixes: tektoncd#2156

Signed-off-by: Andrea Frittoli <[email protected]>
tekton-robot pushed a commit that referenced this issue Oct 10, 2024
Stop sending events to kafka, send them directly to the listener.
Next step will be to deprovision the kafka cluster, to help
reducing infra costs.

Partially-fixes: #2156

Signed-off-by: Andrea Frittoli <[email protected]>
@afrittoli afrittoli reopened this Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant