Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add some autoreboot on UKI for failures #3041

Open
Tracked by #2128
Itxaka opened this issue Nov 27, 2024 · 6 comments
Open
Tracked by #2128

feat: Add some autoreboot on UKI for failures #3041

Itxaka opened this issue Nov 27, 2024 · 6 comments
Assignees
Labels
blocked enhancement New feature or request

Comments

@Itxaka
Copy link
Member

Itxaka commented Nov 27, 2024

Now that we have boot assesment we would need to add some auto reboot in certain scenarios for UKI so the boot assesment is more valued.

If we boot complete with no changes and systemctl status report the system as running the current entry will be marked as GOOD and boot assesment removed from that entry

There may be some occasions in which we want to test something or validate something before it marks the entry as good.

It should be documented how to add those services and checks so users can provide their own.

A simple test could be have k3s and check that its up and running and the node is active, if not we mark dont mark the system as good and reboot

The way of doing it its explained in the https://systemd.io/AUTOMATIC_BOOT_ASSESSMENT/

To support additional components that shall succeed before the boot is considered successful, simply place them in units (if they aren’t already) and order them before the generic boot-complete.target target unit, combined with Requires= dependencies from the target, so that the target cannot be reached when any of the units fail. You may add any number of units like this, and only if they all succeed the boot entry is marked as good. Note that the target unit shall pull in these boot checking units, not the other way around.

Depending on the setup, it may be most convenient to pull in such units through normal enablement symlinks, or during early boot using a [generator](https://www.freedesktop.org/software/systemd/man/systemd.generator.html), or even during later boot. In the last case, care must be taken to ensure that the start job is created before boot-complete.target has been reached.

requirements:

  • Provide a simple service that checks k3s status and continues if its working, fails if its not.
  • Add auto reboot service if system is marked as bad
  • Document how this was done so users can provide their own

All this services should not be part of kairos itself or the framework but done as an example and added to Kairos testing if possible to test the auto assessment works as expected. Kairos itself should not be opinionated in this case, maybe on other but not on this as this is mainly an example on how to add. Adding some extra checks and auto restarts may come down the line but its not part of this ticket.

@Itxaka Itxaka added enhancement New feature or request triage Add this label to issues that should be triaged and prioretized in the next planning call labels Nov 27, 2024
@jimmykarily jimmykarily moved this to Todo 🖊 in 🧙Issue tracking board Dec 2, 2024
@jimmykarily jimmykarily moved this from Todo 🖊 to In Progress 🏃 in 🧙Issue tracking board Dec 2, 2024
@jimmykarily jimmykarily removed the triage Add this label to issues that should be triaged and prioretized in the next planning call label Dec 2, 2024
@jimmykarily jimmykarily moved this from In Progress 🏃 to Todo 🖊 in 🧙Issue tracking board Dec 2, 2024
@Itxaka Itxaka moved this from Todo 🖊 to In Progress 🏃 in 🧙Issue tracking board Dec 3, 2024
@Itxaka Itxaka self-assigned this Dec 3, 2024
@Itxaka
Copy link
Member Author

Itxaka commented Dec 10, 2024

putting this on hold.

seems like it doesnt work as expected, so the testing is not possible in its current state

basically:

  • enable boot assesment
  • select a broken entry (passive+3.conf) as default entry in the loader.conf
  • boots, counts down, service fails, it autorestarts (good)
  • systemd-boot tries to find passive+3.conf, but now its called passive+2-1.conf
  • fails, select the next entry (active+3.conf) by default

which is weird because according to bootctl you can select entries by ID, which is the config name minus the boot assessment part, but that also doesnt work. Even when selecting it via bootctl with bootctl select-default passive.conf which does not fail adn select the proper entry, but on boot it again chooses the active entry by default even when bootctl reports the passive as the selected default entry

I reported and asked about this on the systemd-devel mailing list to shed some ligth because Im baffled by this and not sure if we are doing something worng or the systemd-boot/bootctl behaviour is wrong somehow or Im just not understanding it: https://lists.freedesktop.org/archives/systemd-devel/2024-December/050993.html

@Itxaka Itxaka moved this from In Progress 🏃 to Todo 🖊 in 🧙Issue tracking board Dec 10, 2024
@Itxaka
Copy link
Member Author

Itxaka commented Dec 10, 2024

seems to be already reported upstream and this behaviour is indeed wrong: systemd/systemd#31215

@Itxaka
Copy link
Member Author

Itxaka commented Dec 10, 2024

Maybe systemd/systemd#35529

@Itxaka
Copy link
Member Author

Itxaka commented Dec 12, 2024

Gonna try applying the upstream patch directly to our bundled systemd-boot to see if we can fix th is ourselves while we wait for upstream to accept it. As it should only affect sd-boot and we control it, it may be possible

@Itxaka
Copy link
Member Author

Itxaka commented Dec 12, 2024

no, the patch just breaks booting loool

@Itxaka
Copy link
Member Author

Itxaka commented Jan 14, 2025

so we are kind fo blocked here until upstream patches get in or we cna find a different approach.

@Itxaka Itxaka added the blocked label Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked enhancement New feature or request
Projects
Status: Todo 🖊
Development

No branches or pull requests

2 participants