-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Secondary manifest not sent in some cases #89
Comments
Glad y'all fixed that. Sorry, the
Yeah, there are some danger zones where the only really safe option is manual recovery. I also deal with this on occasion. What would your ideal recovery mechanism be? Is it a hammer that works around Uptane? In any case, also consider raising the point more directly with the Uptane brain trust. For now I'll tag @JustinCappos and @trishankkarthik in case they have opinions.
I no longer have access to the ticket, unfortunately, but I agree, the comment implies we were aware that we needed to inform the Director somehow so that then a decision could be made on what to do. That would be a good first step, but it still doesn't actually address recovery. |
Honestly we kind of hit a dead-end on this in our own discussion before coming here. If a secondary ECU can't send a manifest due to some issue, then it's probably fair to say that that ECU requires recovery outside of Aktualizr. In the case of the Director never getting any indication/signal is more what we were interested in our discussion. Though in our discussion we were going back and forth whether it is even correct for Aktualizr to even be doing anything in this case. Or whether this needed to be handled implicitly server-side. Though @tkfu and @cajun-rat had their own opinions about this, especially @tkfu who manages the server side of our stack. |
Yeah, my thinking on this is that it has to be the back-end's responsibility to deal with it somehow. Fundamentally, we must reckon with the possibility that the primary is lying to us. Our only reliable source of truth for what is installed on a secondary is the signed version report from the secondary itself. Now, managed secondaries aren't "true" secondaries, of course. There's no realistic scenario where aktualizr-primary is compromised, but a managed secondary is intact and trustable. (For anyone from Uptane reading this who might not be familiar with aktualizr: managed secondaries are "fake" or virtual secondaries. They're run entirely on the Linux-based primary.) So in principle, one could imagine a server-side workaround for this issue where you just decide to trust the installation report from the primary, thus closing out the assignment and allowing the managed secondary to re-register with a new key or whatever. But that just sounds like an awful idea. Even if you could make the argument that it's not a violation of the standard (since managed secondaries aren't part of an Uptane system anyway), it would be a pretty bizarre contortion of the server side just to allow for slightly easier remediation of what ought to be a very rare case. So I think the approach we'll take on the server side is to clean up our error reporting: if the primary reports success on all secondaries, but can't prove it via a signed version report from each secondary that had an assignment, we should disregard the report from the primary, and warn the repository owner that one of their secondaries isn't reporting in. I do have some opinions on how aktualizr's reporting behaviour could be improved, but I'm going to separate those out into another comment. |
Aktualizr's behaviour could be better here. We have three information channels relevant to the progress/status of an update assignment:
The problem that arose here is that the secondary was able to send (3) even though it didn't have a key, and aktualizr dutifully forwarded those reports to the server. Subsequently, aktualizr also assembled and signed a manifest that included an installation_report indicating success based on those untrusted events. So, on this basis, the change I'd like to see in aktualizr would be to either:
Both options are complicated in their own way, though, and I'd certainly understand if we just decided it's not a priority to implement either one. If you start signing events, it means both the secondaries and the server also have to change, to start generating/accepting/validating those signed events. If you implement more error checking around the installation_report, there are a bunch of annoying cases to properly think through: for example, if the secondary doesn't respond to a request for its version report, it could be for a number of reasons, including temporary unavailability, so we wouldn't want to report a failure until some kind of timeout occurred or whatever, and that's an opinionated choice that has its own pitfalls. [1] If aktualizr cannot fulfill an update assignment, it sends an installation_report indicating the failure, and it includes a correlation_id that director can use to decide what to do next. Naïvely, we might expect that director should continue to send targets metadata indicating the desired target for all ECUs, but that isn't very useful as a practical matter: we need richer information so that director can decide whether to tell the device to attempt the installation again or not. We don't want the vehicle to get stuck in a loop where it just keeps on trying to install an update, failing every time. You could have all of the retry logic on the client side, and/or attempt to embed a policy engine inside director targets metadata, but both of those options suck in their own way. |
And finally, as for how to recover from this, I think it's clear that recovery is not something aktualizr can be responsible for. If an ECU can't sign version reports, it needs to be replaced, and repository owners already have (or at least should already have) a plan for how to replace ECUs, hopefully following our guidance in the deployment best practices. |
Okay, glad to hear the recovery mechanism is out of scope here, or at least less important. As to the reporting:
This is my preferred option, but it is not perfect for the exact reason you specify. I wish the trigger for the installation report was from the Secondary's version report, not the events. Conceptually what we do now doesn't make sense. The events are not part of Uptane, are purely informational, and shouldn't be used for Uptane-related decision-making.
This would be great. I wish the server had some sort of mechanism for recognizing these error scenarios are reporting them to the user somehow.
Yeah, agreed, please don't do that. Ignoring errors is never good, even when they "shouldn't happen".
Yes, this is what currently happens in several cases, and it's really annoying and wasteful. |
Can you elaborate? We don't generally have that problem, precisely because the installation reports with correlationId allow director to cancel the assignment once aktualizr has reported it as a failure. Is it because you're mostly working with devices that use aktualizr-lite, and thus have no director to talk to? |
Yes, this is one such situation. However, I seem to recall from the days using the Director that we'd get situations where certain errors were not sent upstream effectively, and installations would be repeated. Maybe I'm thinking too far back and we'd fixed that in the meantime. I wouldn't trust that every error scenario in the Secondary gets reported correctly, though, as the OP's situation indicates. We have a lot of tests for these things, but I doubt we cover everything. |
There was an issue with a customer of ours where the keys for their secondary got zeroed out. This caused the case where when an update was sent out, but it could never complete. Since without the proper keys the secondary could not produce a valid signed manifest for itself. Therefore the manifest sent to the server was in-complete. The director rightfully held the status for this device as "update pending" since it never got a complete manifest that marked the update as "done". This required manual intervention to get the device out of this now stuck pending state.
Now the issue with keys getting zeroed out was solved here: #82
But, this still sparked the discussion on what to do generally if a secondary can't produce a manifest for whatever reason. As then a device would get in the same state and require manual intervention. There seems to be an in-code comment as well where this behavior seems to have been brought up before: https://github.com/uptane/aktualizr/blob/master/src/libaktualizr/primary/sotauptaneclient.cc#L353
After discussing with @cajun-rat & @tkfu, we decided to bring this here for additional discussion. The hope being maybe there would be some way to improve this behavior that still aligns with the principals of Uptane.
The text was updated successfully, but these errors were encountered: