-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A possible implementation for managing non-python agents #320
base: main
Are you sure you want to change the base?
Conversation
updates: - [github.com/pre-commit/mirrors-autopep8: v2.0.1 → v2.0.2](pre-commit/mirrors-autopep8@v2.0.1...v2.0.2)
…config [pre-commit.ci] pre-commit autoupdate
* Separate blocks for each registered operation * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests: Remove check for exception in changed agent ops test * Switch dict representation --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Brian Koopman <[email protected]>
…adata-dep Add `importlib_metadata` dependency for ocs-agent-cli
Instead of showing up in top as "ocs-agent-cli", it shows up as "ocs-agent:{instance_id}".
ocs-agent-cli renames itself in process list
* Update os+python version on readthedocs * Add rtd config to ignore paths for workflows
* Add extra dependencies for development * Qualify image names with repository * Fix test failures from calling `os.path.join` on mock objects * Fix mistakes in dev dependency list * Revert "Qualify image names with repository" This change was unsafe, as it would run tests over previously released code, instead of possibly modified, local code. This reverts commit f357ad3. --------- Co-authored-by: C. Weaver <[email protected]>
* ocs-local-support: warn if crossbar/scf disagree on realm/address_root * registry subscribes to {address_root}.* not observatory.* Also changed OCSAgent to fail if any startup subscriptions fail. * aggregator subscribes to {address_root}.* not observatory.* * influxdb_publisher subscribes to {address_root}.* not observatory.* * InfluxDBAgent: exit cleanly on ctrl-c ... even when looping on failed connections to influxdb. * Tweak a few references to "observatory" in docstrings * docs: clarify "observatory" as default address_root * Remove all references to "registry_address" This SCF param actually wasn't used anywhere, except optionally in the ocs-agent-cli (which now simply defaults to {address_root}.registry and can still be overridden). The command-line argument remains, but is doc'ed as deprecated and has no effect on anything. * Fix registry tests (needs args.address_root) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ocs-local-support: better error message if crossbar not managed ... in response to requests for "generate_crossbar_config" and "start crossbar". Also update erroneous docs for the former. * ocs-crossbar image can be configured more, with env Now the crossbar config can be bind-mounted in, or else some envvars are used to generate one from a template before crossbar is started. * ocs_agent: more informative error message on realm mismatch Also the SCF docs warn about updating the crossbar config. * Update documentation related to crossbar configuration --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Brian Koopman <[email protected]>
Cast tuple param before type check
…sses (simonsobs#333) * Log exceptions caught within start() * Create failing test for start task with None params * Handle ParamHandler getting None params * Simplify None params handling * Fix log formatting syntax
…t-docker-compose Replace pytest-docker-compose with pytest-docker plugin
updates: - [github.com/pycqa/flake8: 6.0.0 → 6.1.0](PyCQA/flake8@6.0.0...6.1.0)
…config [pre-commit.ci] pre-commit autoupdate
HostManager can crash and restart really quickly, so this rate-limiting is necessary on some systems to prevent systemd from giving up on the service.
systemd service file: add RestartSec=10s
updates: - [github.com/pre-commit/mirrors-autopep8: v2.0.2 → v2.0.4](pre-commit/mirrors-autopep8@v2.0.2...v2.0.4)
…config [pre-commit.ci] pre-commit autoupdate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution, and sorry it took so long to review it.
My comments below are not intended as the final word -- happy to discuss with you and other OCS devs.
I am on board with adding this capability, but I would like to suggest a modification to the implementation. I would like for the agent-class to still be a mandatory setting in every agent block of the SCF. The alternate operation path would then be entirely triggered by whether agent-exe is set to non-None. (As a by-product, this does simplify a lot of arg-checking logic I think.)
The AgentClass plays an important role in ocs-web, for example to help instantiate the correct control panel for an Agent. (ocs-web uses the class reported by the Agent instance itself, through get_api, rather than whatever HostManager has read from the SCF ... but still, it is useful to think of agent instances as belonging to a particular named class.)
As a parallel with the docker case, where AgentClass is reported in tables with a [d] suffix, I think executable case should have an AgentClass reported, with the [exe] suffix. The path to the binary, which is currently jammed into the [agent] column (in the ocsbow status output for example), could instead be added as a separate column. (For the docker-based agents that column would contain the docker-compose file + service name I guess; beyond the scope of this work but just thinking ahead.)
See a few other comments inline. I am happy to do some of the work on this, as I am in the process of addressing a bunch of other HM issues and it would be great for this to all evolve in harmony.
ocs/agents/host_manager/drivers.py
Outdated
if self.log_file is not None: | ||
self.log_file.flush() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my brief tests, this flush happens too quickly -- before the INT is sent and processed. You should probably do a flush in processExited in addition to / instead of here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed; moving the flush to processExited
seems to work much better.
The Agent executable. This | ||
may be matched against the agent_exe name provided by | ||
the Agent instance, as a way of finding the right | ||
InstanceConfig. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this feature (matching against agent_exe) useful? I have been thinking that the class-matching feature (that it mimics) never really turned out to be that useful, and was considering deprecating it. I would not miss it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I had any particular use in mind; I was just trying to mimic the class-matching feature in case it was somehow important. On that basis, I don't think I would object to removing it, now or later.
ocs/agents/host_manager/agent.py
Outdated
cmd = [instance['agent_exe'], '--instance-id', self.address_root + '.' + iid, | ||
'--wamp-url', self.wamp_url, '--wamp-realm', self.wamp_realm] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand wanting to join address_root and iid, but can you call it --agent-address
or --address
? I don't like redefining --instance-id
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I called it --instance-id
solely out of an attempt to better fit in with the surrounding code; my agents have always taken --address
and I later added this as an alias. So, if you prefer --address
, so do I, and I'm happy to switch it. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally took the time to test this out today, it worked great to startup an external executable. I don't have much to add to @mhasself's review. The one thing I will comment on is the logging.
I'd like to suggest logging stdout/stderr by default, with the option to disable it. We currently run into situations where logging isn't on (because log_dir
isn't set) then an error occurs and we'd like to have logs to debug, but don't. Logging by default avoids that, and allows one to disable the feature if logging is handled some other way, perhaps in the non-python exe.
if "agent_arguments" in instance: | ||
cmd.extend(instance["agent_arguments"]) | ||
if instance['write_logs']: | ||
log_file_path = self.log_dir + '/' + self.address_root + '.' + iid + ".log" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If log_dir
isn't defined in the SCF this'll throw:
2023-09-13T16:50:49-0400 Unhandled Error
Traceback (most recent call last):
File "/home/koopman/git/ocs/ocs/agents/host_manager/agent.py", line 672, in main
runner.run(agent, auto_reconnect=True)
File "/home/koopman/.venv/user/lib/python3.11/site-packages/autobahn/twisted/wamp.py", line 439, in run
reactor.run()
File "/home/koopman/.venv/user/lib/python3.11/site-packages/twisted/internet/base.py", line 1318, in run
self.mainLoop()
File "/home/koopman/.venv/user/lib/python3.11/site-packages/twisted/internet/base.py", line 1328, in mainLoop
reactorBaseSelf.runUntilCurrent()
--- <exception caught here> ---
File "/home/koopman/.venv/user/lib/python3.11/site-packages/twisted/internet/base.py", line 967, in runUntilCurrent
f(*a, **kw)
File "/home/koopman/git/ocs/ocs/agents/host_manager/agent.py", line 353, in _launch_instance
log_file_path = self.log_dir + '/' + self.address_root + '.' + iid + ".log"
builtins.TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
If this isn't set, consistent behavior with OCS would be to just not log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spotting that; it was definitely a bug.
for more information, see https://pre-commit.ci
…nto non-python-agents
Thanks for the review comments! I do have a few questions on how exactly to implement some of the bigger picture ones:
I didn't do this in large part because it wasn't obvious to me what value one should use for the agent class of something which is not in fact a python class. It could be treated as just a label, but the instance ID already has that role? I suppose for display purposes the basename of the executable path could be used, as that's fairly analogous to a python class name, but it would still seem odd to me to make the user specify this, when doing so is both redundant with the executable path (which is necessary), and introduces a point of possible confusion if the user doesn't keep the two items properly synchronized when writing or editing the configuration.
I've never used the web interface, so I missed this. However, this use-case still leaves unclear what the agent class should be for an executable. It's not clear to me what control panels there are, or where they come from; is there something sensible for ocs-web to select for an arbitrary agent? I guess I would have expected it, at least in the generic case, to use the reflection provided by WAMP to discover what commands, etc. each agent supports.
This is certainly doable, but it would take up more space, and depending on what value was used for the agent class it might be no more informative? I do see the utility in being able to display more information about the docker case; if that is the direction in which to go, though, what label would make sense for the column which might contain an executable path, might contain docker info, or might be empty (for python agents, unless there's also something relevant to display for them)?
I'm happy to make logging the default; in fact I'm not sure why I didn't to begin with. That being said, I'd like to better understand the situation around |
I don't really have much to add, only that I like the idea of a default log directory, but agree it might be out of scope here. I'd need to consider how we would handle log rotation. Some Agents' logs can be kind of verbose. In the Docker context, this'll make container's disk usage grow (and Docker is also storing the logs, so we're doubling this log output at that point.) |
This is a tentative set of changes to support managing agents (or indeed any program) which are not written in python (and without using Docker).
Description
This change allows agents to be specified with an executable,
agent-exe
, as an alternative toagent-class
. If this is done, the executable is run directly, rather than running the python interpreter on a script or module.Arguments are passed somewhat differently to agent executables: The
agent_arguments
are passed directly, and the WAMP (router) URL and realm are passed, instead of the OCS site file and host. Additionally, the instance ID passed to the child is fully qualified, with the OCS address root. The logic for this is to avoid every agent having to reimplement the OCS config parsing logic, when the HostManager has already done the parsing, and has all necessary information available to it to pass on. The disadvantage is that handling of python and generic agents is not symmetric.A
capture_logs
boolean option is added for agents. When this is enabled, theAgentProcessHelper
writes data captured from stdout/stderr to a file alongside existing logs. This is not fancy, but at least ensures that log data from programs which write normally to stdout/stderr can be checked.These changes are orthogonal to, and should not interfere with, the existing Docker agent management mechanism/encoding, but it is somewhat dissatisfying that python, docker, and generic code paths are not more parallel. The hope is that this avoids breaking any of the Docker handling, which I cannot easily test, but it is not very elegant.
Motivation and Context
I would like to manage agents which are not written in python due to their requirements for multi-threading and low-latency operation on high data volumes, e.g. the CMB-S4 DAQ collector prototype. It is already possible for such agents to communicate over WAMP to publish and subscribe to data feeds and have operations controlled with an
OCSClient
without changes to OCS, but it is also desirable to be able to manage their lifetimes (viaocsbow
, and theHostManager
, etc.) as well.How Has This Been Tested?
All tests in the test suite pass, and I am successfully able to manage (bring up, keep running, shut down, and view logs) a C++ agent alongside standard python agents on an Ubuntu host.
Types of changes
Checklist:
Documentation of these features is still missing; I will be happy to write it if/when we converge on the implementation details. I am also not entirely sure that there are not gaps in my implementation; I have tested by running my own use-case (with both python and non-python agents), and the test suite, but I don't have a good feel for the completeness of the coverage that provides.