-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add task events to the scheduler #2043
Conversation
Pickup in a new PR
* main: (25 commits) Create object history API (#2074) Bump actions/github-script from 6 to 7 (#2076) Installation manual for Windows (2) (#2096) Update howdoesitwork.rst (#2091) Add benchmarking script to the scheduler (#2071) Add fix-poetry-merge-conflict makefile command (#2088) Bump sphinx-rtd-theme from 1.2.2 to 2.0.0 (#2080) Lower quality level so the CI check doesn't fail (#2086) Update xtdb version in octopoes CI docker compose and docker-compose.release-example.yml (#2085) Name test nodes by testname instead of uuid (#2087) Upgrade to Pydantic v2 (#1912) Docs: add dependency installation commands for RHEL based systems (#2059) Fix/2072 (#2082) Feature/service to systems reports rocky (#2073) Update scheduler python packages (#2062) Add uvicorn back as non-dev dependency (#2053) Bump `cryptography` (#2070) Filter tree objects with depth=1 for Findings (#1982) Bump aiohttp from 3.8.6 to 3.9.0 in /boefjes (#2061) Translations update from Hosted Weblate (#2057) ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general and nice tests 👍 Some general remarks to reconsider:
- Although it's a bit challenging since you need to define these upfront (as much as possible), you might want to use enum fields in the events table (e.g. for type and event). This will improve performance in db time
- Instead of storing the complete task in the event, consider referencing it by its ID. This not only saves storage, but also (slightly) improves query response times. Keep in mind though, that this could potentially lead to more queries if the task data is required later. However, in such cases, it's worth looking at materialized views for such specific queries.
Good idea, think I'll pick this up in another PR. At the moment still determining what the values of these column could and should be.
Yeah wanted specifically to keep the task data in there, mainly because we then have a way to track state changes of an object. Which would allow us more complex queries to be done. At the moment this is indeed limited to tasks, but will be extended with events regarding tasks, task consumption for instance. |
Checklist for QA:
What works:Onboarding flow works, can schedule additional boefjes, tasks are scheduled as expected and additional findings are created. What doesn't work:n/a Bug or feature?:n/a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can see the events table solution will scale a lot worse compared to simply storing the data in the task table. The task will only be executed once and duration/queued/runtime only need to be stored once for a task, so there is no need for 1:n relation table. Storing it as simple fields in the task table means:
- We can give the fields proper timedelta/interval types
- We can easily index it. While you can also add indexes to jsonb, that will always have more unnecessary indirection and will thus be slower.
- Smaller database. This is important for performance, because in my experience performance doesn't decrease linearly with database size, but it goes downhill a lot faster when the database gets bigger.
- Half the number of queries, because the events table needs an extra insert every time we do an insert and update.
With regards to data is not generated yet for runtimes that do not exist yet, I'd say it would be better to follow the YAGNI principle (https://en.wikipedia.org/wiki/You_aren't_gonna_need_it). And even if we already want to implement it, it would be better to just add a jsonb field in the task table for it.
type="events.db", | ||
context="task", | ||
event="update", | ||
data=task.model_dump(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will result in a very big event table, because the task model is pretty big already and this will be duplicated unnecessarily multiple times in the events table. I don't think this is a good idea with regards to performance and resource usage.
|
||
queued: Optional[float] = Field(None, alieas="queued", readonly=True) | ||
|
||
runtime: Optional[float] = Field(None, alias="runtime", readonly=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type should be timedelta
instead of float
.
|
||
id = Column(Integer, primary_key=True) | ||
|
||
task_id = Column(GUID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a foreign key to the task table.
Yes, you bring up valid points. My consideration with the suggested changes was that keeping an event log opens up more flexibility to do more complex queries regarding the tasks (#1578). However, I agree we can opt for a more simple approach and cross that bridge when we get there. |
Closing, superseded by #2214 |
Changes
This PR add task events functionality to the scheduler.
This allows us to:
Considerations
The queries for duration, queued, and runtime should only be done when requested. It should not be done for the task list endpointhas been implementedTODO
Issue link
Closes 1956