-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add task duration intervals to the scheduler #2451
base: main
Are you sure you want to change the base?
Conversation
* main: Add less verbose logging and introduce structured logging in the scheduler (#2143) Remove default setting of id to None (#2166) Web and system report integration testing (#2203) Download report JSON and ReportData OOI (#2205) Fixes sidemenu animation and behaviour (#2192) Feat/clearance level trail (#2202) Add scheduler stats to boefje/normaliser task pages in rocky. (#2204) Fix/ooi detail improvements (#2201) Translations update from Hosted Weblate (#2148) Add a request model for push endpoint in scheduler (#2169) fix rpki delete_on_close, can be just delete (#2170) Increases gap between items within dl (#2193)
* main: Updated deploy-pages action (#2251) Bump msgpackr from 1.6.1 to 1.10.1 in /rocky (#2219) Add note about Debian packages to 1.14 release notes (#2234) Upgrade GitHub actions (#2235) Remove RPKI data files (#2228) Add 1.14 release notes (#2227) remove feature flag for reports. (#2226) Add multi organization report (#2209) Preselected OOIs for Reporting (#2208) Text changes OOI into Object (#2224) Adds the option to add chapter letters (#2225) Fix typos in aggregate report (#2168) Alert icons (#2221) Fix health endpoint scheduler (#2218) Table captions (#2217) Initial version of aggregate report appendix (#2212) add html entities escaping to normalizer results (#2211) Pdf and print styling (#2206) Adding recommendations to the aggregate report (#2210)
dispatched = Column(Interval) | ||
running = Column(Interval) | ||
|
||
meta = Column(JSONB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added meta
column for runners to relay and persist information about a task, e.g. cpu, mem load, run times etc. If this info is already available through bytes we can opt to reference that. However, it might be more difficult/time-consuming to retrieve that information when we want to make rankings based on that information since we don't have it readily available in the scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can opt for removing this field for now since it depends on changes or implementation in runners
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather collect this info in the scheduler for fast-access, as its something the user wants to filter on, see on the tasklist. Having those details only in Bytes would mean we end up reading those jobmeta's for the whole list / or worse for all jobs when rendering a filtered task list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question where and how we want to to store these meta values in the schedulers database is still very valid though.
modified_at=datetime.now(timezone.utc), | ||
) | ||
self.ctx.datastores.task_store.create_task(task) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored to make the sequence pending > queued > dispatched work correctly within the scheduler
Signed-off-by: Donny Peeters <[email protected]> Co-authored-by: ammar92 <[email protected]> Co-authored-by: Jeroen Dekkers <[email protected]>
…ghbour_query_multi` (#2592) Co-authored-by: Jan Klopper <[email protected]>
Co-authored-by: Jeroen Dekkers <[email protected]>
Co-authored-by: Jeroen Dekkers <[email protected]> Co-authored-by: Joost Krapels <[email protected]> thanks ! @JKrapels Co-authored-by: LibreTranslate <[email protected]>
Co-authored-by: Jan Klopper <[email protected]> Co-authored-by: Jeroen Dekkers <[email protected]>
Co-authored-by: Ammar <[email protected]> Co-authored-by: Madelon D <[email protected]> Co-authored-by: Jeroen Dekkers <[email protected]> Co-authored-by: stephanie0x00 <[email protected]>
Co-authored-by: Jeroen Dekkers <[email protected]>
* main: fix deprecated warning due to old env in .env-defaults (#2597) add extra checks for findings to dns report (#2506) Fix missing finding_type table (#2596) Translations update from Hosted Weblate (#2594) TLS Report unit tests (#2593) Filter out undeserializable objects from xtdb query in `construct_neighbour_query_multi` (#2592) Feature/efficient reporting for all reports (#2586)
|
A rebase borked the PR a bit, probably going to put the changes into a separate PR. Unless there is a way to fix the history |
@@ -22,10 +22,13 @@ def test_one_process(manager: SchedulerWorkerManager, item_handler: MockHandler) | |||
|
|||
patched_tasks = manager.scheduler_client.get_all_patched_tasks() | |||
|
|||
assert len(patched_tasks) == 3 | |||
assert patched_tasks[0] == ("70da7d4f-f41f-4940-901b-d98a92e9014b", "completed") | |||
assert len(patched_tasks) == 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Donnype would you mind taking a look if I'm making the correct assumptions within these tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpbruinsslot These patches seem logical right? All of them go from running to either completed or failed 👍
@@ -22,10 +22,13 @@ def test_one_process(manager: SchedulerWorkerManager, item_handler: MockHandler) | |||
|
|||
patched_tasks = manager.scheduler_client.get_all_patched_tasks() | |||
|
|||
assert len(patched_tasks) == 3 | |||
assert patched_tasks[0] == ("70da7d4f-f41f-4940-901b-d98a92e9014b", "completed") | |||
assert len(patched_tasks) == 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpbruinsslot These patches seem logical right? All of them go from running to either completed or failed 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am missing a functional specification about what we are trying to achieve. If we want to store an accurate execution time of boefjes and normalizers (which is the primary reason of this change as far as I understood it), we would first need to define what we mean with with that/what we want to store. We will also be using new docker / kubernetes runners soon and I think it would also be good to take those in account.
These are just some ideas, I think we need to brainstorm/discuss this more:
- With the docker / kubernetes runner, if the container executes one task the execution time might be the time the container has run.
- If it would be a long-running container that will executes multiple tasks, we can't use the container execution time. One way would be to have the container report the execution time back when submitting the boefje results. We would have to trust the container to do this correctly, but I don't think this would be a big problem in practice.
- Another option would be to calculate the time between the input being fetched in the boefjes api and the output being submitted. This might be less accurate if there is latency between boefjes api and the container or the boefjes api is overloaded. If we want to add batching in the boefjes API this would also not work.
- For the current boefje/normalizer runner, we might take the execution time for the
run
function.
I think doing status updates queries from the runner to the schedule and then storing the timestamps won't give the most accurate execution time because it will include the latency of those requests. For very simple boefjes and most normalizer the overhead might even be more than the execution time. The latency might also be higher when if we are going to have runners that will be inside different internal networks in the future. If the scheduler and/or database is overloaded it would also give more inaccurate information.
The way it is currently implemented we also include the time it takes to fetch OOIs or raw files and save raw files or OOIs which I'm not sure we should include in the task duration.
I think that the best solution would be for the runner to be responsible for calculating the execution time and providing that to the scheduler to store it.
Database wise I think this can just be 32-bit integer that will store the execution time in miliseconds. That will provide a resolution of 1 ms and maximum of 24 days which should be enough. That way we would only need 4 bytes instead of 16 bytes for the PostgreSQL interval type that has very small resolution and high range.
With regards to the other timestamps being added, do we really need them all? It's not going to be the biggest bottleneck, but if we want to scale to millions of tasks, I think it is important to make the rows larger than nessary. The existing table design would also need improvements to achieve this, but that shouldn't stop us from doing new things in the most efficient way.
Pending_at and created_at will always be the same, so I don't think it is useful to have a pending_at? I'm also not sure how useful it is store queued_at and dispatched_at for every task. If I think from a user perspective, the interesting data are:
- When was the task created (created_at timestamp)
- When was the task executed (running_at timestamp, maybe executed_at would be a better name?)
- How long did task execution take (duration in ms).
While the other timestamp might be useful for debugging what is going on inside the scheduler, I'm not sure that is reason enough to store them for every task in production systems. For debugging we also already have debug logging with which we can see when a task is queued, fetched by the runner, etc.
@@ -470,6 +474,16 @@ def patch_task(self, task_id: str, item: dict) -> Any: | |||
detail="failed to update task", | |||
) from exc | |||
|
|||
# Retrieve updated task | |||
try: | |||
updated_task = self.ctx.datastores.task_store.get_task_by_id(task_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what you are trying to achieve with fetching the just saved task from the database.
As discussed (sorry for writing this up so late): The duration that the user wants to see is the execution runtime of the boefje/normalizer job. The only way to do this is to have the runner provide the information. The duration needs to be sortable with large number of tasks, because a user might want to sort the tasks by duration to see which tasks took a long time to run. So we should store the data in such a way that is easy for PostgreSQL to sort. The suration can be stored most effeciently as miliseconds using a 32-bit integer. This would be have a maximum of 2147483648 miliseconds or 24.8 days which should be more than enough. There was also discussion about whether to store all the data in the task table or a separate table. |
I think the duration and other task data we might collect in the future such as cpu usage and mem usage can best be added in the task table itself. The reason I think this is the best is that we never update this data separate from the task table. When the runner returns this data it does that at the same time as signalling that the task is done, so we already do an update query on the task table to change the status. It should be more efficient to save the other data in the same update query instead of doing two update queries. It also isn't useful to query the data without the task information, so in practice you would always do a join on the task table, so also for quering data there doesn't seem to be an adavantage in storing the data in a separate table. The task also runs only once, so there is no 1:n relationship that would require a separate table. |
Considerations:
|
pending_at: datetime | None = Field(default_factory=lambda: datetime.now(timezone.utc)) | ||
queued_at: datetime | None = None | ||
dispatched_at: datetime | None = None | ||
running_at: datetime | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
started_at
?
|
||
created_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc)) | ||
modified_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed if the only state changes are for the status that you use the ..._at fields for?
meta: dict | None = None | ||
|
||
# Status transition timestamps | ||
pending_at: datetime | None = Field(default_factory=lambda: datetime.now(timezone.utc)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just the same as created_at? Could we reduce this to:
created_at
queued_at
started_at
ended_at
The latter two being set by the runner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way you know how long:
- it takes for tasks to complete from queueing them (ended_at - queued_at)
- When the task actually started (started_at)
- The runtime (ended_at - started_at)
- The complete lifecycle (ended_at - created_at) for throughput analysis.
- Queueing time (started_at - queued_at (?))
Also, I'd imagine for longrunning tasks that you'd be interested in when they started, not when they were queued. That's I'd say a valid use-case for the started_at/ended_at approach vs. the runtimes? Unless you want to infer that through modified_at. Or the end times in general would be a relevant column to analyse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I just thought of: I guess the default sorting is going to be the modified_at
in the setup where we only save runtimes?
Changes
pending_at
,queued_at
,dispatched_at
,running_at
are added, these signify the timestamp that a task was in that particular state.Task
class addedupdate_status
to calculate the duration of a task transition when the status updates.meta
jsonb column to task. Added meta column for runners to relay and persist information about a task, e.g. cpu, mem load, run times etc.running
Closes #1956