Add task events to the scheduler #2043

jpbruinsslot · 2023-11-20T08:27:35Z

Changes

This PR add task events functionality to the scheduler.

Adds an event to the events table on inserts and updates on the tasks table. E.g. a task is created or a task status is updated. This is done by leveraging postgresql triggers.
Exposes the ability to add events to the events table by a rest api endpoint

This allows us to:

Add calculated attributes to a task like how long a task was queued, how long the task ran for in the task runner, how long the task took in total.
Create additional events that are related to a task, e.g: how much cpu, memory, network i/o the task runner used.

Considerations

If we want, we can still opt to use columns with the runtime, queued, and duration fields instead of this adding this to an events table, however this can limit the extension of using additional information like cpu, memory, network i/o load, unless adding additional database columns.
Usage of postgres trigger, could be considered as a bit of obfuscation, and a bit of magic that is done in the 'background'. We probably want to us Python instead, on tasks updates from the api
Additionally we probably want to move away from those triggers if the task runner creates events like cpu, memory load etc.
Scalability, when considering high loads we can end up with a large events table, and perhaps partitioning should be considered, or compacting on date.
~~The queries for duration, queued, and runtime should only be done when requested. It should not be done for the task list endpoint~~ has been implemented
The fields type, context, and events can superfluous, and could be combined into 1 field and perhaps with dot notation

TODO

Create EventStore
Create postgresql trigger on task updates to create event in the events table
Fix serialization between sqlalchemy and pydantic models
Implement events endpoint
Write tests
Make sure that list task view doesn't call the individual queries, that should be done on the detail view

Issue link

Closes 1956

Pickup in a new PR

* main: (25 commits) Create object history API (#2074) Bump actions/github-script from 6 to 7 (#2076) Installation manual for Windows (2) (#2096) Update howdoesitwork.rst (#2091) Add benchmarking script to the scheduler (#2071) Add fix-poetry-merge-conflict makefile command (#2088) Bump sphinx-rtd-theme from 1.2.2 to 2.0.0 (#2080) Lower quality level so the CI check doesn't fail (#2086) Update xtdb version in octopoes CI docker compose and docker-compose.release-example.yml (#2085) Name test nodes by testname instead of uuid (#2087) Upgrade to Pydantic v2 (#1912) Docs: add dependency installation commands for RHEL based systems (#2059) Fix/2072 (#2082) Feature/service to systems reports rocky (#2073) Update scheduler python packages (#2062) Add uvicorn back as non-dev dependency (#2053) Bump `cryptography` (#2070) Filter tree objects with depth=1 for Findings (#1982) Bump aiohttp from 3.8.6 to 3.9.0 in /boefjes (#2061) Translations update from Hosted Weblate (#2057) ...

ammar92

Looks good in general and nice tests 👍 Some general remarks to reconsider:

Although it's a bit challenging since you need to define these upfront (as much as possible), you might want to use enum fields in the events table (e.g. for type and event). This will improve performance in db time
Instead of storing the complete task in the event, consider referencing it by its ID. This not only saves storage, but also (slightly) improves query response times. Keep in mind though, that this could potentially lead to more queries if the task data is required later. However, in such cases, it's worth looking at materialized views for such specific queries.

mula/scheduler/server/server.py

jpbruinsslot · 2023-12-11T11:12:54Z

Looks good in general and nice tests 👍 Some general remarks to reconsider:

Although it's a bit challenging since you need to define these upfront (as much as possible), you might want to use enum fields in the events table (e.g. for type and event). This will improve performance in db time

Good idea, think I'll pick this up in another PR. At the moment still determining what the values of these column could and should be.

Instead of storing the complete task in the event, consider referencing it by its ID. This not only saves storage, but also (slightly) improves query response times. Keep in mind though, that this could potentially lead to more queries if the task data is required later. However, in such cases, it's worth looking at materialized views for such specific queries.

Yeah wanted specifically to keep the task data in there, mainly because we then have a way to track state changes of an object. Which would allow us more complex queries to be done. At the moment this is indeed limited to tasks, but will be extended with events regarding tasks, task consumption for instance.

stephanie0x00 · 2023-12-11T16:29:06Z

Checklist for QA:

I have checked out this branch, and successfully ran a fresh make reset.
I confirmed that there are no unintended functional regressions in this branch:
- I have managed to pass the onboarding flow
- Objects and Findings are created properly
- Tasks are created and completed properly
I confirmed that the PR's advertised feature or hotfix works as intended.

What works:

Onboarding flow works, can schedule additional boefjes, tasks are scheduled as expected and additional findings are created.

What doesn't work:

n/a

Bug or feature?:

n/a

dekkers

As far as I can see the events table solution will scale a lot worse compared to simply storing the data in the task table. The task will only be executed once and duration/queued/runtime only need to be stored once for a task, so there is no need for 1:n relation table. Storing it as simple fields in the task table means:

We can give the fields proper timedelta/interval types
We can easily index it. While you can also add indexes to jsonb, that will always have more unnecessary indirection and will thus be slower.
Smaller database. This is important for performance, because in my experience performance doesn't decrease linearly with database size, but it goes downhill a lot faster when the database gets bigger.
Half the number of queries, because the events table needs an extra insert every time we do an insert and update.

With regards to data is not generated yet for runtimes that do not exist yet, I'd say it would be better to follow the YAGNI principle (https://en.wikipedia.org/wiki/You_aren't_gonna_need_it). And even if we already want to implement it, it would be better to just add a jsonb field in the task table for it.

dekkers · 2023-12-11T22:13:47Z

mula/scheduler/schedulers/scheduler.py

+            type="events.db",
+            context="task",
+            event="update",
+            data=task.model_dump(),


This will result in a very big event table, because the task model is pretty big already and this will be duplicated unnecessarily multiple times in the events table. I don't think this is a good idea with regards to performance and resource usage.

dekkers · 2023-12-11T22:21:58Z

mula/scheduler/models/tasks.py

+
+    queued: Optional[float] = Field(None, alieas="queued", readonly=True)
+
+    runtime: Optional[float] = Field(None, alias="runtime", readonly=True)


Type should be timedelta instead of float.

dekkers · 2023-12-11T22:44:18Z

mula/scheduler/models/events.py

+
+    id = Column(Integer, primary_key=True)
+
+    task_id = Column(GUID)


This should be a foreign key to the task table.

jpbruinsslot · 2023-12-12T10:04:20Z

As far as I can see the events table solution will scale a lot worse compared to simply storing the data in the task table. The task will only be executed once and duration/queued/runtime only need to be stored once for a task, so there is no need for 1:n relation table. Storing it as simple fields in the task table means:

We can give the fields proper timedelta/interval types

We can easily index it. While you can also add indexes to jsonb, that will always have more unnecessary indirection and will thus be slower.

Smaller database. This is important for performance, because in my experience performance doesn't decrease linearly with database size, but it goes downhill a lot faster when the database gets bigger.

Half the number of queries, because the events table needs an extra insert every time we do an insert and update.

With regards to data is not generated yet for runtimes that do not exist yet, I'd say it would be better to follow the YAGNI principle (https://en.wikipedia.org/wiki/You_aren't_gonna_need_it). And even if we already want to implement it, it would be better to just add a jsonb field in the task table for it.

Yes, you bring up valid points. My consideration with the suggested changes was that keeping an event log opens up more flexibility to do more complex queries regarding the tasks (#1578). However, I agree we can opt for a more simple approach and cross that bridge when we get there.

jpbruinsslot · 2024-01-02T13:13:34Z

Closing, superseded by #2214

jpbruinsslot added 3 commits November 14, 2023 18:04

Add task events table

9b73610

Extend database queries and models

24f68f3

Add event store

a2041eb

jpbruinsslot added the mula Issues related to the scheduler label Nov 20, 2023

jpbruinsslot self-assigned this Nov 20, 2023

jpbruinsslot added 8 commits November 20, 2023 09:30

Remove test files

7514691

Fix serialization between sqlalchemy and pydantic models

91042df

Add events endpoint

014a80c

Implement event endpoints and add tests

4aadee8

Formatting

bd2a146

Ignore A002

86167c6

Merge branch 'main' into feature/mula/task-events

3214048

Merge branch 'main' into feature/mula/task-events

80663cb

jpbruinsslot marked this pull request as ready for review November 23, 2023 09:43

jpbruinsslot requested a review from a team as a code owner November 23, 2023 09:43

jpbruinsslot requested review from Donnype and ammar92 and removed request for ammar92 November 23, 2023 09:46

jpbruinsslot marked this pull request as draft November 23, 2023 10:45

Remove additional events

2f36177

Pickup in a new PR

This was referenced Nov 23, 2023

Add additional events support for tasks #2052

Closed

Task events for scheduler #1956

Open

jpbruinsslot added 5 commits November 27, 2023 13:46

Differentiate between list of task and individual tasks

36f8246

Remove postgres trigger

b0044cc

Remove trigger and add tests

d1572d2

Update tests

cf5fd6b

jpbruinsslot marked this pull request as ready for review December 6, 2023 17:09

jpbruinsslot added 2 commits December 7, 2023 10:39

Merge branch 'main' into feature/mula/task-events

ef2c691

Merge branch 'main' into feature/mula/task-events

8ad6a5f

jpbruinsslot added 2 commits December 7, 2023 12:25

Update and fix tests

eb8b97f

Formatting

97404b5

stephanie0x00 self-requested a review December 8, 2023 16:30

Merge branch 'main' into feature/mula/task-events

b96f51b

ammar92 previously approved these changes Dec 11, 2023

View reviewed changes

mula/scheduler/server/server.py Outdated Show resolved Hide resolved

Combine exceptions

d489b48

jpbruinsslot dismissed ammar92’s stale review via d489b48 December 11, 2023 11:18

dekkers reviewed Dec 11, 2023

View reviewed changes

underdarknl mentioned this pull request Dec 15, 2023

As a user I want to see the impact/load a boefje has on a system #386

Open

jpbruinsslot marked this pull request as draft December 28, 2023 08:58

jpbruinsslot mentioned this pull request Jan 2, 2024

Alternative: Add task events to the scheduler #2214

Closed

jpbruinsslot closed this Jan 2, 2024

jpbruinsslot deleted the feature/mula/task-events branch February 18, 2025 09:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add task events to the scheduler #2043

Add task events to the scheduler #2043

jpbruinsslot commented Nov 20, 2023 •

edited

Loading

ammar92 left a comment

jpbruinsslot commented Dec 11, 2023

stephanie0x00 commented Dec 11, 2023

dekkers left a comment

dekkers Dec 11, 2023

dekkers Dec 11, 2023

dekkers Dec 11, 2023

jpbruinsslot commented Dec 12, 2023

jpbruinsslot commented Jan 2, 2024


		queued: Optional[float] = Field(None, alieas="queued", readonly=True)

		runtime: Optional[float] = Field(None, alias="runtime", readonly=True)


		id = Column(Integer, primary_key=True)

		task_id = Column(GUID)

Add task events to the scheduler #2043

Add task events to the scheduler #2043

Conversation

jpbruinsslot commented Nov 20, 2023 • edited Loading

Changes

Considerations

TODO

Issue link

ammar92 left a comment

Choose a reason for hiding this comment

jpbruinsslot commented Dec 11, 2023

stephanie0x00 commented Dec 11, 2023

Checklist for QA:

What works:

What doesn't work:

Bug or feature?:

dekkers left a comment

Choose a reason for hiding this comment

dekkers Dec 11, 2023

Choose a reason for hiding this comment

dekkers Dec 11, 2023

Choose a reason for hiding this comment

dekkers Dec 11, 2023

Choose a reason for hiding this comment

jpbruinsslot commented Dec 12, 2023

jpbruinsslot commented Jan 2, 2024

jpbruinsslot commented Nov 20, 2023 •

edited

Loading