Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add task events to the scheduler #2043

Closed
wants to merge 23 commits into from
Closed

Conversation

jpbruinsslot
Copy link
Contributor

@jpbruinsslot jpbruinsslot commented Nov 20, 2023

Changes

This PR add task events functionality to the scheduler.

  1. Adds an event to the events table on inserts and updates on the tasks table. E.g. a task is created or a task status is updated. This is done by leveraging postgresql triggers.
  2. Exposes the ability to add events to the events table by a rest api endpoint

This allows us to:

  1. Add calculated attributes to a task like how long a task was queued, how long the task ran for in the task runner, how long the task took in total.
  2. Create additional events that are related to a task, e.g: how much cpu, memory, network i/o the task runner used.

Considerations

  • If we want, we can still opt to use columns with the runtime, queued, and duration fields instead of this adding this to an events table, however this can limit the extension of using additional information like cpu, memory, network i/o load, unless adding additional database columns.
  • Usage of postgres trigger, could be considered as a bit of obfuscation, and a bit of magic that is done in the 'background'. We probably want to us Python instead, on tasks updates from the api
  • Additionally we probably want to move away from those triggers if the task runner creates events like cpu, memory load etc.
  • Scalability, when considering high loads we can end up with a large events table, and perhaps partitioning should be considered, or compacting on date.
  • The queries for duration, queued, and runtime should only be done when requested. It should not be done for the task list endpoint has been implemented
  • The fields type, context, and events can superfluous, and could be combined into 1 field and perhaps with dot notation

TODO

  • Create EventStore
  • Create postgresql trigger on task updates to create event in the events table
  • Fix serialization between sqlalchemy and pydantic models
  • Implement events endpoint
  • Write tests
  • Make sure that list task view doesn't call the individual queries, that should be done on the detail view

Issue link

Closes 1956

@jpbruinsslot jpbruinsslot added the mula Issues related to the scheduler label Nov 20, 2023
@jpbruinsslot jpbruinsslot self-assigned this Nov 20, 2023
@jpbruinsslot jpbruinsslot marked this pull request as ready for review November 23, 2023 09:43
@jpbruinsslot jpbruinsslot requested a review from a team as a code owner November 23, 2023 09:43
@jpbruinsslot jpbruinsslot requested review from Donnype and ammar92 and removed request for ammar92 November 23, 2023 09:46
@jpbruinsslot jpbruinsslot marked this pull request as draft November 23, 2023 10:45
Pickup in a new PR
* main: (25 commits)
  Create object history API (#2074)
  Bump actions/github-script from 6 to 7 (#2076)
  Installation manual for Windows (2) (#2096)
  Update howdoesitwork.rst (#2091)
  Add benchmarking script to the scheduler (#2071)
  Add fix-poetry-merge-conflict makefile command (#2088)
  Bump sphinx-rtd-theme from 1.2.2 to 2.0.0 (#2080)
  Lower quality level so the CI check doesn't fail (#2086)
  Update xtdb version in octopoes CI docker compose and docker-compose.release-example.yml (#2085)
  Name test nodes by testname instead of uuid (#2087)
  Upgrade to Pydantic v2 (#1912)
  Docs: add dependency installation commands for RHEL based systems (#2059)
  Fix/2072 (#2082)
  Feature/service to systems reports rocky (#2073)
  Update scheduler python packages (#2062)
  Add uvicorn back as non-dev dependency (#2053)
  Bump `cryptography` (#2070)
  Filter tree objects with depth=1 for Findings  (#1982)
  Bump aiohttp from 3.8.6 to 3.9.0 in /boefjes (#2061)
  Translations update from Hosted Weblate (#2057)
  ...
@jpbruinsslot jpbruinsslot marked this pull request as ready for review December 6, 2023 17:09
@stephanie0x00 stephanie0x00 self-requested a review December 8, 2023 16:30
ammar92
ammar92 previously approved these changes Dec 11, 2023
Copy link
Contributor

@ammar92 ammar92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general and nice tests 👍 Some general remarks to reconsider:

  • Although it's a bit challenging since you need to define these upfront (as much as possible), you might want to use enum fields in the events table (e.g. for type and event). This will improve performance in db time
  • Instead of storing the complete task in the event, consider referencing it by its ID. This not only saves storage, but also (slightly) improves query response times. Keep in mind though, that this could potentially lead to more queries if the task data is required later. However, in such cases, it's worth looking at materialized views for such specific queries.

mula/scheduler/server/server.py Outdated Show resolved Hide resolved
@jpbruinsslot
Copy link
Contributor Author

Looks good in general and nice tests 👍 Some general remarks to reconsider:

  • Although it's a bit challenging since you need to define these upfront (as much as possible), you might want to use enum fields in the events table (e.g. for type and event). This will improve performance in db time

Good idea, think I'll pick this up in another PR. At the moment still determining what the values of these column could and should be.

  • Instead of storing the complete task in the event, consider referencing it by its ID. This not only saves storage, but also (slightly) improves query response times. Keep in mind though, that this could potentially lead to more queries if the task data is required later. However, in such cases, it's worth looking at materialized views for such specific queries.

Yeah wanted specifically to keep the task data in there, mainly because we then have a way to track state changes of an object. Which would allow us more complex queries to be done. At the moment this is indeed limited to tasks, but will be extended with events regarding tasks, task consumption for instance.

@stephanie0x00
Copy link
Contributor

Checklist for QA:

  • I have checked out this branch, and successfully ran a fresh make reset.
  • I confirmed that there are no unintended functional regressions in this branch:
    • I have managed to pass the onboarding flow
    • Objects and Findings are created properly
    • Tasks are created and completed properly
  • I confirmed that the PR's advertised feature or hotfix works as intended.

What works:

Onboarding flow works, can schedule additional boefjes, tasks are scheduled as expected and additional findings are created.

What doesn't work:

n/a

Bug or feature?:

n/a

Copy link
Contributor

@dekkers dekkers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see the events table solution will scale a lot worse compared to simply storing the data in the task table. The task will only be executed once and duration/queued/runtime only need to be stored once for a task, so there is no need for 1:n relation table. Storing it as simple fields in the task table means:

  • We can give the fields proper timedelta/interval types
  • We can easily index it. While you can also add indexes to jsonb, that will always have more unnecessary indirection and will thus be slower.
  • Smaller database. This is important for performance, because in my experience performance doesn't decrease linearly with database size, but it goes downhill a lot faster when the database gets bigger.
  • Half the number of queries, because the events table needs an extra insert every time we do an insert and update.

With regards to data is not generated yet for runtimes that do not exist yet, I'd say it would be better to follow the YAGNI principle (https://en.wikipedia.org/wiki/You_aren't_gonna_need_it). And even if we already want to implement it, it would be better to just add a jsonb field in the task table for it.

type="events.db",
context="task",
event="update",
data=task.model_dump(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will result in a very big event table, because the task model is pretty big already and this will be duplicated unnecessarily multiple times in the events table. I don't think this is a good idea with regards to performance and resource usage.


queued: Optional[float] = Field(None, alieas="queued", readonly=True)

runtime: Optional[float] = Field(None, alias="runtime", readonly=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type should be timedelta instead of float.


id = Column(Integer, primary_key=True)

task_id = Column(GUID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a foreign key to the task table.

@jpbruinsslot
Copy link
Contributor Author

As far as I can see the events table solution will scale a lot worse compared to simply storing the data in the task table. The task will only be executed once and duration/queued/runtime only need to be stored once for a task, so there is no need for 1:n relation table. Storing it as simple fields in the task table means:

  • We can give the fields proper timedelta/interval types
  • We can easily index it. While you can also add indexes to jsonb, that will always have more unnecessary indirection and will thus be slower.
  • Smaller database. This is important for performance, because in my experience performance doesn't decrease linearly with database size, but it goes downhill a lot faster when the database gets bigger.
  • Half the number of queries, because the events table needs an extra insert every time we do an insert and update.

With regards to data is not generated yet for runtimes that do not exist yet, I'd say it would be better to follow the YAGNI principle (https://en.wikipedia.org/wiki/You_aren't_gonna_need_it). And even if we already want to implement it, it would be better to just add a jsonb field in the task table for it.

Yes, you bring up valid points. My consideration with the suggested changes was that keeping an event log opens up more flexibility to do more complex queries regarding the tasks (#1578). However, I agree we can opt for a more simple approach and cross that bridge when we get there.

@jpbruinsslot
Copy link
Contributor Author

Closing, superseded by #2214

@jpbruinsslot jpbruinsslot deleted the feature/mula/task-events branch February 18, 2025 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mula Issues related to the scheduler
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Task events for scheduler
5 participants