-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RSDK-9440 Report machine state
through GetMachineStatus
#4616
base: main
Are you sure you want to change the base?
Conversation
state
through GetMachineStatus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still a WIP wrt testing; will leave in draft. These are my ideas so far, though.
web/server/entrypoint.go
Outdated
// and immediately start web service. We need the machine to be reachable | ||
// through the web service ASAP, even if some resources take a long time to | ||
// initially configure. | ||
minimalProcessedConfig := &(*fullProcessedConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe CopyPublicFields? might look less janky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
web/server/entrypoint.go
Outdated
if err := web.RunWeb(ctx, myRobot, options, s.logger); err != nil { | ||
return err | ||
} | ||
myRobot.Reconfigure(ctx, fullProcessedConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a risk that the config watcher would call reconfigure before this reconfigure is called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes; good call out. Discussed offline a bit, we should start the config watcher goroutine only after this reconfigure is called.
That does mean that we have the same behavior as today: when the robot is starting up (both in first robotimpl.New(minimalConfig)
and myRobot.Reconfigure(fullProcessedConfig)
) no new config changes will be seen. So, if a user messes up their config and accidentally starts a module that takes forever to start up, they will not be able to quickly remove that module from their config. Instead, they'll have to restart/shutdown their robot if they want to stop the initial construction. Once again, I don't think this is different from what we have currently, and, of course, viam-server
is receptive to gRPC requests earlier with the changes in this PR.
febcd0b
to
a88c3fa
Compare
I broke a lot of tests that I'm presuming are expecting resources to be available as soon as the web service is available. Thinking about it. |
533ae80
to
a4d585d
Compare
cli/client_test.go
Outdated
@@ -374,7 +374,7 @@ func TestTabularDataByFilterAction(t *testing.T) { | |||
var dataRequested bool | |||
//nolint:deprecated,staticcheck | |||
tabularDataByFilterFunc := func(ctx context.Context, in *datapb.TabularDataByFilterRequest, opts ...grpc.CallOption, | |||
//nolint:deprecated | |||
//nolint:deprecated,staticcheck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lint won't pass locally without this for me.
robot/client/client.go
Outdated
// | ||
// It is expected that golang SDK users will handle lack of resource | ||
// availability due to the machine being in an initializing state themselves. | ||
if testing.Testing() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the "check" we discussed offline. See the comment above for an explanation; let me know if we're not on the same page about this logic not be required outside of a testing environment.
minimalProcessedConfig.Modules = nil | ||
minimalProcessedConfig.Processes = nil | ||
|
||
myRobot, err := robotimpl.New(ctx, minimalProcessedConfig, s.logger, robotOptions...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the best way of achieving what we want or the most expedient?
I'm fine with this as-is. And I'm kind of fine never coming back to think about this. But the whole "robot owns the web server" feels backwards.
There would be less states to consider if we could start a web service and register robots with it. And there'd be a small API that describes:
- What state the robot is in (startup or running) and
- which APIs are available, e.g:
- just "GetMachineStatus" and maybe "ResourceNames"
- but none of "SetPower"/other resource specific APIs
But in this PR we have 90-100 lines between this comment/robot.New and when the web service is started. That's a lot of lines to accidentally break our contract and add some blocking code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the best way of achieving what we want or the most expedient?
Perhaps not, and I understand your argument there. I've introduced a slightly different/simpler mechanic for controlling the "initializing" value, so that might address some of your concerns here. I didn't go so far as starting a web service and registering robots with it (if I'm understanding what you're saying.)
web/server/entrypoint.go
Outdated
// Use `fullProcessedConfig` as the initial `oldCfg` for the config watcher | ||
// goroutine, as we want incoming config changes to be compared to the full | ||
// config. | ||
oldCfg := fullProcessedConfig | ||
utils.ManagedGo(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably about time this lambda gets its own function/name. I think a lot of my above concern goes away if this 60 lines of control flow keyword soup is hidden by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea; working on it + will re-request review when done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done; maybe.
@@ -479,7 +502,8 @@ func (s *robotServer) serveWeb(ctx context.Context, cfg *config.Config) (err err | |||
}() | |||
defer cancel() | |||
|
|||
options, err := s.createWebOptions(processedConfig) | |||
// Create initial web options with `minimalProcessedConfig`. | |||
options, err := s.createWebOptions(minimalProcessedConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't comment off-diff. The goroutine spun off above will check for diff.NetworkEqual
and if not, run myRobot.StartWeb
(newline 490).
Just below this we call web.RunWeb
. I'm not sure what the significance is between having different methods, StartWeb
and RunWeb
, but assuming that's not interesting: is it possible for those two things to race? Are we guaranteed to end up with the right set of weboptions
?
To clarify, this is a question about existing behavior. I don't think this patch changed anything here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goroutine spun off above will check for diff.NetworkEqual and if not, run myRobot.StartWeb (newline 490).
Correct; I believe there's a call to StopWeb
before that happens, too. And then a Reconfigure
after that StopWeb
. All about handling network changes in the config.
Just below this we call web.RunWeb. I'm not sure what the significance is between having different methods, StartWeb and RunWeb, but assuming that's not interesting
It's interesting having those two methods. StartWeb
starts up the web service on the robot. RunWeb
does that, but also waits on <-ctx.Done()
, so it's a blocking call and represents the "main" program that "runs" when you call go run web/cmd/server/main.go
.
Is it possible for those two things to race? Are we guaranteed to end up with the right set of weboptions?
It depends what you mean by race. There is a lock on starting the web service, so I'm not sure we'd see a race manifest as an actual DATA RACE
, but I think you are "right" in wondering about the "right set of weboptions." I'm not entirely sure, but I think RunWeb
would run into an error if it tried to start the web service with an old set of options after the config watcher goroutine had started it already with a new set of options. So, my guess is we'd see an error from RunWeb
and an inability to start the server in the event of the race you're describing.
robot/impl/local_robot.go
Outdated
@@ -498,6 +502,8 @@ func newWithResources( | |||
} | |||
|
|||
successful = true | |||
// Robot is "initializing" until first reconfigure after initial creation completes. | |||
r.initializing.Store(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically there's a Reconfigure
earlier on newcode 491 that sets this value to false. Not to mention the initializing
value is initialized (ugh) to false. Two things:
- I'm taking that it's important we exit this function with
initializing
set to true. But I would expect to see the setting up at the top near the constructor. Can we document that the placement here is intentional to avoid prior calls mucking with the state? - Is this function guaranteed to not start webserver and expose the
GetMachineStatus
API? If it can, it seems we might be settinginitializing
too late and may allow clients to observe an illegal transition of ready -> initializing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also -- line 428 (old) 432 (new) refers to the mod manager web server. Do we need to consider/provide guidelines for how module SDKs (which are -- in theory -- different from "application SDKs") use and perhaps expose GetMachineStatus
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've modified the mechanics here slightly. You'll see there's a new robot.Option
to start a robot in initializing mode. You can then use SetInitializing(false)
to mark the robot as running. This means that only the code here in web/server/entrypoint.go
is "special" with respect to initialization. All other calls to robotimpl.New
will create robots that always return robot.StateRunning
from MachineStatus
.
I'm not sure that will address all your concerns here, and I'll think a bit harder about your module question before re-requesting review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to consider/provide guidelines for how module SDKs (which are -- in theory -- different from "application SDKs") use and perhaps expose GetMachineStatus?
Slightly confused about your question here. The mod manager web server is started before any modules are added to the module manager, so it should always be the case that "module SDKs" (I'm reading that as Golang, Python, and C++ module libraries) should have the ability to connect back to the RDK through the mod manager web server before any module process has started. Were you suggesting that the module libraries should be using MachineStatus
to check the status of the parent RDK, or that they should expose their own MachineStatus
endpoint for some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly confused about your question here. The mod manager web server is started before any modules are added to the module manager, so it should always be the case that "module SDKs" (I'm reading that as Golang, Python, and C++ module libraries) should have the ability to connect back to the RDK through the mod manager web server before any module process has started.
Given that, it sounds like if* a module, as soon as it was possible, tried calling MachineStatus
they'd get an initialized == false
.
But it also sounds like in the current, pre-patch code, a module could actually call ResourceNames
before a regular "network client" could? Because we wouldn't have started accepting connections yet? And the result of calling ResourceNames
would be undefined as the robot hasn't necessarily done/completed its initial reconfigure yet? Just considering the "happy path" where the initial robot config is good.
And if that's true, modules already have to be "resilient" to talking with an "uninitialized" robot. And we would not expect to need module changes. Such as the testing changes to "wait by default" for a robot to be initialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that, it sounds like if* a module, as soon as it was possible, tried calling MachineStatus they'd get an initialized == false.
That sounds correct to me.
But it also sounds like in the current, pre-patch code, a module could actually call ResourceNames before a regular "network client" could? Because we wouldn't have started accepting connections yet?
That sounds correct to me. In particular, the "module web server" of the RDK will be open for connections while the regular web server will not be. So, a module could connect via the former before a regular "network client" could connect via the latter.
And the result of calling ResourceNames would be undefined as the robot hasn't necessarily done/completed its initial reconfigure yet?
That does not sound correct to me. If a module is able to call ResourceNames
through the module server, it will see the current status of the resource graph in terms of available names. If some resources have already completed configuration, the module will see them through ResourceNames
.
And if that's true, modules already have to be "resilient" to talking with an "uninitialized" robot. And we would not expect to need module changes.
Modules do expect a certain guarantee around modular dependencies: if a modular resource A depends on another resource B, it is expected that, barring any inability to create B, B will be available and usable through ResourceByName
within the constructor for A. I actually broke that guarantee and caused TestComplexModule
to fail here.
We need the web service, which is "weakly" (it's actually, annoyingly, a fourth type of hardcoded-weak dependency that is not registered via WeakDependencies
in resource registration) dependent on all resources to Reconfigure
before the modular base builds such that the web service is aware of the two motors that the modular base depends on.
I "fixed" the test by updating weak dependents more often and more simply in completeConfig
(see my incoming comment.)
// been closed above. This ensures processes are shutdown before any files | ||
// are deleted they are using. | ||
// | ||
// If initializing, machine will be starting with no modules, but may |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this. Does anything actually go wrong if we don't guard this "cleanup" logic? Or are we just suggesting that making these calls would be "wasteful" no-ops?
The existing comment/first paragraph refers to "cleanup unused packages", so if we never started up with any, where are they coming from? Existing files on the file system from a previous start?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cheukt had mentioned it would be good to guard the lines below based on initialization.
Or are we just suggesting that making these calls would be "wasteful" no-ops?
That's my understanding, yep.
Existing files on the file system from a previous start?
Also my understanding, yep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we don't guard this call, I suspect that we would cleanup modules from the full config that aren't in the initial minimal config, so we end up deleting modules that would be used. If offline, the robot would no longer be able to re-download the module and start up correctly.
0bb518a
to
9e68fd6
Compare
// depends on at least one resource with weak dependencies (weak dependents) | ||
// - The logical clock is higher than the `lastWeakDependentsRound` value | ||
// At the start of every reconfiguration level, check if updateWeakDependents should be run | ||
// by checking if the logical clock is higher than the `lastWeakDependentsRound` value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spoke a bit offline with @cheukt : I got rid of the "at least one resource that needs to reconfigure in this level depends on at least one resource with weak dependencies (weak dependents)" logic. We now only consider the value of the current logical clock as compared to the last time we ran updateWeakDependents
. All tests seem to pass without changes, and the web service properly updates before building the modular base as described in my other comment (i.e. TestComplexModule/Test_Base
passes and is able to find the motors the modular base depends on here.)
I'm not entirely surprised we don't need the logic I removed, as removing it only causes updateWeakDependents
to be called more frequently. From my understanding, this will mean the web service, framesystem, and anything weak dependencies will likely get their Reconfigure
methods called more often. Is that bad? Maybe? I think we need to consider weak dependencies and their necessity more holistically and perhaps not in this PR.
Happy to talk offline about this change. I think the comments in all the tests @cheukt added in robot_reconfigure_weak_dependencies_test.go
will need to be changed. Again, would like to handle that as part of a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks generally good, but some questions around testing
// | ||
// It is expected that golang SDK users will handle lack of resource | ||
// availability due to the machine being in an initializing state themselves. | ||
if testing.Testing() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add this block inside robottestutils.NewRobotClient instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good question and something I thought about doing. Many tests make connections to robots they've set up with plain old client.New
. So, I'd have to change all of those calls to be robottestutils.NewRobotClient
instead, and we would have to remember to use robottestutils.NewRobotClient
instead of client.New
for any future tests. I opted to put the check in client.New
since I thought we'd likely forget to use robottestutils.NewRobotClient
for future tests, but I don't have a super strong opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make change all client.New
s to robottestutils.NewRobotClient
a ticket? I think it would be good to segregate the testing code from actual code if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure: https://viam.atlassian.net/browse/RSDK-9609.
@@ -192,3 +193,54 @@ func isExpectedShutdownError(err error, testLogger logging.Logger) bool { | |||
testLogger.Errorw("Unexpected shutdown error", "err", err) | |||
return false | |||
} | |||
|
|||
// Tests that machine state properly reports initializing or running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a test that tests the transition from initializing to running? I'm guess we may have to register a resource that takes a long time to start up but it would be good to have a test that fully tests the feature that you're adding.
but can discuss if it's difficult to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another good question; it should be possible, yeah. I can at least delay the movement from initializing to running with a long-running constructor, but I think the test will be a bit racey.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya, I think we should add a test
// If initializing, machine will be starting with no modules, but may | ||
// immediately reconfigure to start modules that have already been | ||
// downloaded. Do not cleanup packages/module dirs in that case. | ||
if !r.initializing.Load() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a test for this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep.
RSDK-9440
Changes:
State
field torobot.MachineStatus
both server and client sideStateInitializing
inrobot.MachineStatus
before reconfigure with full config occursStateRunning
inrobot.MachineStatus
after reconfigure with full config occursSetInitializing
method onrobot.LocalRobot
for the above two points to workTesting:
MachineStatus
tests to make assertions onState
client.New
when in testing