Add task duration intervals to the scheduler #2451

jpbruinsslot · 2024-02-07T10:20:54Z

Changes

Add functionality to track duration intervals for tasks.
Columns pending_at, queued_at, dispatched_at, running_at are added, these signify the timestamp that a task was in that particular state.
Method on Task class added update_status to calculate the duration of a task transition when the status updates.
Added meta jsonb column to task. Added meta column for runners to relay and persist information about a task, e.g. cpu, mem load, run times etc.
Added status update patch request to scheduler api from runners to update task status to running

Closes #1956

* main: Add less verbose logging and introduce structured logging in the scheduler (#2143) Remove default setting of id to None (#2166) Web and system report integration testing (#2203) Download report JSON and ReportData OOI (#2205) Fixes sidemenu animation and behaviour (#2192) Feat/clearance level trail (#2202) Add scheduler stats to boefje/normaliser task pages in rocky. (#2204) Fix/ooi detail improvements (#2201) Translations update from Hosted Weblate (#2148) Add a request model for push endpoint in scheduler (#2169) fix rpki delete_on_close, can be just delete (#2170) Increases gap between items within dl (#2193)

* main: Updated deploy-pages action (#2251) Bump msgpackr from 1.6.1 to 1.10.1 in /rocky (#2219) Add note about Debian packages to 1.14 release notes (#2234) Upgrade GitHub actions (#2235) Remove RPKI data files (#2228) Add 1.14 release notes (#2227) remove feature flag for reports. (#2226) Add multi organization report (#2209) Preselected OOIs for Reporting (#2208) Text changes OOI into Object (#2224) Adds the option to add chapter letters (#2225) Fix typos in aggregate report (#2168) Alert icons (#2221) Fix health endpoint scheduler (#2218) Table captions (#2217) Initial version of aggregate report appendix (#2212) add html entities escaping to normalizer results (#2211) Pdf and print styling (#2206) Adding recommendations to the aggregate report (#2210)

boefjes/boefjes/local.py

jpbruinsslot · 2024-02-07T10:27:31Z

mula/scheduler/models/tasks.py

+    dispatched = Column(Interval)
+    running = Column(Interval)
+
+    meta = Column(JSONB)


Added meta column for runners to relay and persist information about a task, e.g. cpu, mem load, run times etc. If this info is already available through bytes we can opt to reference that. However, it might be more difficult/time-consuming to retrieve that information when we want to make rankings based on that information since we don't have it readily available in the scheduler.

We can opt for removing this field for now since it depends on changes or implementation in runners

I'd rather collect this info in the scheduler for fast-access, as its something the user wants to filter on, see on the tasklist. Having those details only in Bytes would mean we end up reading those jobmeta's for the whole list / or worse for all jobs when rendering a filtered task list.

The question where and how we want to to store these meta values in the schedulers database is still very valid though.

jpbruinsslot · 2024-02-07T10:30:06Z

mula/scheduler/schedulers/scheduler.py

+            modified_at=datetime.now(timezone.utc),
+        )
+        self.ctx.datastores.task_store.create_task(task)
+


Refactored to make the sequence pending > queued > dispatched work correctly within the scheduler

mula/scheduler/models/tasks.py

Signed-off-by: Donny Peeters <[email protected]> Co-authored-by: ammar92 <[email protected]> Co-authored-by: Jeroen Dekkers <[email protected]>

…ghbour_query_multi` (#2592) Co-authored-by: Jan Klopper <[email protected]>

Co-authored-by: Jeroen Dekkers <[email protected]>

@JKrapels

Co-authored-by: Jeroen Dekkers <[email protected]> Co-authored-by: Joost Krapels <[email protected]> thanks ! @JKrapels Co-authored-by: LibreTranslate <[email protected]>

Co-authored-by: Jan Klopper <[email protected]> Co-authored-by: Jeroen Dekkers <[email protected]>

Co-authored-by: Ammar <[email protected]> Co-authored-by: Madelon D <[email protected]> Co-authored-by: Jeroen Dekkers <[email protected]> Co-authored-by: stephanie0x00 <[email protected]>

Co-authored-by: Jeroen Dekkers <[email protected]>

* main: fix deprecated warning due to old env in .env-defaults (#2597) add extra checks for findings to dns report (#2506) Fix missing finding_type table (#2596) Translations update from Hosted Weblate (#2594) TLS Report unit tests (#2593) Filter out undeserializable objects from xtdb query in `construct_neighbour_query_multi` (#2592) Feature/efficient reporting for all reports (#2586)

CLAassistant · 2024-03-06T09:04:31Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
5 out of 6 committers have signed the CLA.

✅ madelondohmen
✅ originalsouth
✅ Donnype
✅ jpbruinsslot
✅ underdarknl
❌ weblate
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

jpbruinsslot · 2024-03-07T08:09:05Z

A rebase borked the PR a bit, probably going to put the changes into a separate PR. Unless there is a way to fix the history

jpbruinsslot · 2024-03-07T08:12:53Z

boefjes/tests/test_app.py

@@ -22,10 +22,13 @@ def test_one_process(manager: SchedulerWorkerManager, item_handler: MockHandler)

    patched_tasks = manager.scheduler_client.get_all_patched_tasks()

-    assert len(patched_tasks) == 3
-    assert patched_tasks[0] == ("70da7d4f-f41f-4940-901b-d98a92e9014b", "completed")
+    assert len(patched_tasks) == 6


@Donnype would you mind taking a look if I'm making the correct assumptions within these tests.

@jpbruinsslot These patches seem logical right? All of them go from running to either completed or failed 👍

boefjes/Makefile

boefjes/boefjes/app.py

Donnype · 2024-03-07T09:45:36Z

boefjes/tests/test_app.py

@@ -22,10 +22,13 @@ def test_one_process(manager: SchedulerWorkerManager, item_handler: MockHandler)

    patched_tasks = manager.scheduler_client.get_all_patched_tasks()

-    assert len(patched_tasks) == 3
-    assert patched_tasks[0] == ("70da7d4f-f41f-4940-901b-d98a92e9014b", "completed")
+    assert len(patched_tasks) == 6


@jpbruinsslot These patches seem logical right? All of them go from running to either completed or failed 👍

boefjes/boefjes/local.py

dekkers

I am missing a functional specification about what we are trying to achieve. If we want to store an accurate execution time of boefjes and normalizers (which is the primary reason of this change as far as I understood it), we would first need to define what we mean with with that/what we want to store. We will also be using new docker / kubernetes runners soon and I think it would also be good to take those in account.

These are just some ideas, I think we need to brainstorm/discuss this more:

With the docker / kubernetes runner, if the container executes one task the execution time might be the time the container has run.
If it would be a long-running container that will executes multiple tasks, we can't use the container execution time. One way would be to have the container report the execution time back when submitting the boefje results. We would have to trust the container to do this correctly, but I don't think this would be a big problem in practice.
Another option would be to calculate the time between the input being fetched in the boefjes api and the output being submitted. This might be less accurate if there is latency between boefjes api and the container or the boefjes api is overloaded. If we want to add batching in the boefjes API this would also not work.
For the current boefje/normalizer runner, we might take the execution time for the run function.

I think doing status updates queries from the runner to the schedule and then storing the timestamps won't give the most accurate execution time because it will include the latency of those requests. For very simple boefjes and most normalizer the overhead might even be more than the execution time. The latency might also be higher when if we are going to have runners that will be inside different internal networks in the future. If the scheduler and/or database is overloaded it would also give more inaccurate information.

The way it is currently implemented we also include the time it takes to fetch OOIs or raw files and save raw files or OOIs which I'm not sure we should include in the task duration.

I think that the best solution would be for the runner to be responsible for calculating the execution time and providing that to the scheduler to store it.

Database wise I think this can just be 32-bit integer that will store the execution time in miliseconds. That will provide a resolution of 1 ms and maximum of 24 days which should be enough. That way we would only need 4 bytes instead of 16 bytes for the PostgreSQL interval type that has very small resolution and high range.

With regards to the other timestamps being added, do we really need them all? It's not going to be the biggest bottleneck, but if we want to scale to millions of tasks, I think it is important to make the rows larger than nessary. The existing table design would also need improvements to achieve this, but that shouldn't stop us from doing new things in the most efficient way.

Pending_at and created_at will always be the same, so I don't think it is useful to have a pending_at? I'm also not sure how useful it is store queued_at and dispatched_at for every task. If I think from a user perspective, the interesting data are:

When was the task created (created_at timestamp)
When was the task executed (running_at timestamp, maybe executed_at would be a better name?)
How long did task execution take (duration in ms).

While the other timestamp might be useful for debugging what is going on inside the scheduler, I'm not sure that is reason enough to store them for every task in production systems. For debugging we also already have debug logging with which we can see when a task is queued, fetched by the runner, etc.

dekkers · 2024-03-12T23:11:51Z

mula/scheduler/server/server.py

@@ -470,6 +474,16 @@ def patch_task(self, task_id: str, item: dict) -> Any:
                detail="failed to update task",
            ) from exc

+        # Retrieve updated task
+        try:
+            updated_task = self.ctx.datastores.task_store.get_task_by_id(task_id)


I don't understand what you are trying to achieve with fetching the just saved task from the database.

dekkers · 2024-04-08T15:34:08Z

As discussed (sorry for writing this up so late):

The duration that the user wants to see is the execution runtime of the boefje/normalizer job. The only way to do this is to have the runner provide the information.

The duration needs to be sortable with large number of tasks, because a user might want to sort the tasks by duration to see which tasks took a long time to run. So we should store the data in such a way that is easy for PostgreSQL to sort. The suration can be stored most effeciently as miliseconds using a 32-bit integer. This would be have a maximum of 2147483648 miliseconds or 24.8 days which should be more than enough.

There was also discussion about whether to store all the data in the task table or a separate table.

dekkers · 2024-04-08T15:36:01Z

There was also discussion about whether to store all the data in the task table or a separate table.

I think the duration and other task data we might collect in the future such as cpu usage and mem usage can best be added in the task table itself. The reason I think this is the best is that we never update this data separate from the task table. When the runner returns this data it does that at the same time as signalling that the task is done, so we already do an update query on the task table to change the status. It should be more efficient to save the other data in the same update query instead of doing two update queries.

It also isn't useful to query the data without the task information, so in practice you would always do a join on the task table, so also for quering data there doesn't seem to be an adavantage in storing the data in a separate table. The task also runs only once, so there is no 1:n relationship that would require a separate table.

jpbruinsslot · 2024-04-09T08:00:51Z

Considerations:

Do we want to and/or expect that we also want to track and query the durations of other statuses than RUNNING. If not, then we can add a single column to the task column with this duration. If so, then we need to consider if we want to have those status durations in the task table or to normalize in order to not clutter the task table.
CPU/mem usage metrics seem specific towards a boefje runner and might not necessarily relate to other task runners. Additionally, we're not considering other metrics like disk usage and network usage which might not be relevant to other task runners. Considering the agnostic nature of the scheduler we then we need to evaluate whether to normalize those metrics fields as well. I suggest we need to discuss this in another ticket since it also begs questions about ux: how is this going to be used, queried and visualized, and if other tools that track these metrics are better suited than to integrate this into the scheduler.

Donnype · 2024-04-11T07:02:14Z

mula/scheduler/models/tasks.py

+    pending_at: datetime | None = Field(default_factory=lambda: datetime.now(timezone.utc))
+    queued_at: datetime | None = None
+    dispatched_at: datetime | None = None
+    running_at: datetime | None = None


started_at?

Donnype · 2024-04-11T07:04:14Z

mula/scheduler/models/tasks.py


+    created_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
    modified_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))


Is this needed if the only state changes are for the status that you use the ..._at fields for?

Donnype · 2024-04-11T07:07:26Z

mula/scheduler/models/tasks.py

+    meta: dict | None = None
+
+    # Status transition timestamps
+    pending_at: datetime | None = Field(default_factory=lambda: datetime.now(timezone.utc))


This is just the same as created_at? Could we reduce this to:

created_at
queued_at
started_at
ended_at

The latter two being set by the runner?

This way you know how long:

it takes for tasks to complete from queueing them (ended_at - queued_at)

When the task actually started (started_at)

The runtime (ended_at - started_at)

The complete lifecycle (ended_at - created_at) for throughput analysis.

Queueing time (started_at - queued_at (?))

Also, I'd imagine for longrunning tasks that you'd be interested in when they started, not when they were queued. That's I'd say a valid use-case for the started_at/ended_at approach vs. the runtimes? Unless you want to infer that through modified_at. Or the end times in general would be a relevant column to analyse

What I just thought of: I guess the default sorting is going to be the modified_at in the setup where we only save runtimes?

jpbruinsslot added 8 commits December 21, 2023 18:09

Add task events

11a7504

Implement alternative task events

76b3759

Set index for task_id

dffc491

Implement alternative

8d6718f

Fix and update tests

3f25a24

Add tests

f3b257c

jpbruinsslot added the mula Issues related to the scheduler label Feb 7, 2024

jpbruinsslot self-assigned this Feb 7, 2024

jpbruinsslot commented Feb 7, 2024

View reviewed changes

boefjes/boefjes/local.py Outdated Show resolved Hide resolved

jpbruinsslot commented Feb 7, 2024

View reviewed changes

boefjes/boefjes/local.py Outdated Show resolved Hide resolved

jpbruinsslot commented Feb 7, 2024

View reviewed changes

mula/scheduler/models/tasks.py Outdated Show resolved Hide resolved

jpbruinsslot added 3 commits February 7, 2024 12:38

Pre commit

e9875f0

Merge branch 'main' into feature/mula/task-events-alternative-2

8a64695

Fix

7754dc8

jpbruinsslot marked this pull request as ready for review February 7, 2024 13:00

jpbruinsslot requested a review from a team as a code owner February 7, 2024 13:00

jpbruinsslot mentioned this pull request Feb 7, 2024

Alternative: Add task events to the scheduler #2214

Closed

jpbruinsslot and others added 9 commits February 8, 2024 17:10

Merge branch 'main' into feature/mula/task-events-alternative-2

1139f27

Merge branch 'main' into feature/mula/task-events-alternative-2

b559b2b

Merge branch 'main' into feature/mula/task-events-alternative-2

ccc315d

Change datetime fields

7c60699

Merge branch 'main' into feature/mula/task-events-alternative-2

33bf243

Pre-commit

49abda1

Feature/efficient reporting for all reports (#2586)

657e936

Signed-off-by: Donny Peeters <[email protected]> Co-authored-by: ammar92 <[email protected]> Co-authored-by: Jeroen Dekkers <[email protected]>

Filter out undeserializable objects from xtdb query in `construct_nei…

4b8df88

…ghbour_query_multi` (#2592) Co-authored-by: Jan Klopper <[email protected]>

TLS Report unit tests (#2593)

fb71c62

Co-authored-by: Jeroen Dekkers <[email protected]>

weblate and others added 5 commits March 5, 2024 16:46

Translations update from Hosted Weblate (#2594)

6015d7c

Co-authored-by: Jeroen Dekkers <[email protected]> Co-authored-by: Joost Krapels <[email protected]> thanks ! @JKrapels Co-authored-by: LibreTranslate <[email protected]>

Fix missing finding_type table (#2596)

c5e79db

Co-authored-by: Jan Klopper <[email protected]> Co-authored-by: Jeroen Dekkers <[email protected]>

add extra checks for findings to dns report (#2506)

33f8aab

Co-authored-by: Ammar <[email protected]> Co-authored-by: Madelon D <[email protected]> Co-authored-by: Jeroen Dekkers <[email protected]> Co-authored-by: stephanie0x00 <[email protected]>

fix deprecated warning due to old env in .env-defaults (#2597)

c8590ab

Co-authored-by: Jeroen Dekkers <[email protected]>

minvws deleted a comment from CLAassistant Mar 5, 2024

Fix tests

03ec671

jpbruinsslot added 2 commits March 6, 2024 10:12

Fix tests

5abb1ae

Fix tests

39c0704

jpbruinsslot commented Mar 7, 2024

View reviewed changes

Donnype reviewed Mar 7, 2024

View reviewed changes

dekkers reviewed Mar 12, 2024

View reviewed changes

Donnype reviewed Apr 11, 2024

View reviewed changes

jpbruinsslot marked this pull request as draft May 14, 2024 07:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add task duration intervals to the scheduler #2451

Add task duration intervals to the scheduler #2451

jpbruinsslot commented Feb 7, 2024 •

edited

Loading

jpbruinsslot Feb 7, 2024 •

edited

Loading

jpbruinsslot Mar 7, 2024

underdarknl Jul 2, 2024

underdarknl Jul 2, 2024

jpbruinsslot Feb 7, 2024

CLAassistant commented Mar 6, 2024 •

edited

Loading

jpbruinsslot commented Mar 7, 2024

jpbruinsslot Mar 7, 2024

Donnype Mar 7, 2024

Donnype Mar 7, 2024

dekkers left a comment

dekkers Mar 12, 2024

dekkers commented Apr 8, 2024 •

edited

Loading

dekkers commented Apr 8, 2024

jpbruinsslot commented Apr 9, 2024 •

edited

Loading

Donnype Apr 11, 2024

Donnype Apr 11, 2024

Donnype Apr 11, 2024

Donnype Apr 11, 2024

Donnype Apr 11, 2024


		created_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
		modified_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))

Add task duration intervals to the scheduler #2451

Are you sure you want to change the base?

Add task duration intervals to the scheduler #2451

Conversation

jpbruinsslot commented Feb 7, 2024 • edited Loading

Changes

jpbruinsslot Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CLAassistant commented Mar 6, 2024 • edited Loading

jpbruinsslot commented Mar 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dekkers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dekkers commented Apr 8, 2024 • edited Loading

dekkers commented Apr 8, 2024

jpbruinsslot commented Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpbruinsslot commented Feb 7, 2024 •

edited

Loading

jpbruinsslot Feb 7, 2024 •

edited

Loading

CLAassistant commented Mar 6, 2024 •

edited

Loading

dekkers commented Apr 8, 2024 •

edited

Loading

jpbruinsslot commented Apr 9, 2024 •

edited

Loading