Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add task duration intervals to the scheduler #2451

Draft
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

jpbruinsslot
Copy link
Contributor

@jpbruinsslot jpbruinsslot commented Feb 7, 2024

Changes

  • Add functionality to track duration intervals for tasks.
  • Columns pending_at, queued_at, dispatched_at, running_at are added, these signify the timestamp that a task was in that particular state.
  • Method on Task class added update_status to calculate the duration of a task transition when the status updates.
  • Added meta jsonb column to task. Added meta column for runners to relay and persist information about a task, e.g. cpu, mem load, run times etc.
  • Added status update patch request to scheduler api from runners to update task status to running

Closes #1956

* main:
  Add less verbose logging and introduce structured logging in the scheduler (#2143)
  Remove default setting of id to None (#2166)
  Web and system report integration testing (#2203)
  Download report JSON and ReportData OOI (#2205)
  Fixes sidemenu animation and behaviour (#2192)
  Feat/clearance level trail (#2202)
  Add scheduler stats to boefje/normaliser task pages in rocky. (#2204)
  Fix/ooi detail improvements (#2201)
  Translations update from Hosted Weblate (#2148)
  Add a request model for push endpoint in scheduler (#2169)
  fix rpki delete_on_close, can be just delete (#2170)
  Increases gap between items within dl (#2193)
* main:
  Updated deploy-pages action (#2251)
  Bump msgpackr from 1.6.1 to 1.10.1 in /rocky (#2219)
  Add note about Debian packages to 1.14 release notes (#2234)
  Upgrade GitHub actions (#2235)
  Remove RPKI data files (#2228)
  Add 1.14 release notes (#2227)
  remove feature flag for reports. (#2226)
  Add multi organization report (#2209)
  Preselected OOIs for Reporting (#2208)
  Text changes OOI into Object (#2224)
  Adds the option to add chapter letters (#2225)
  Fix typos in aggregate report (#2168)
  Alert icons (#2221)
  Fix health endpoint scheduler (#2218)
  Table captions (#2217)
  Initial version of aggregate report appendix (#2212)
  add html entities escaping to normalizer results (#2211)
  Pdf and print styling (#2206)
  Adding recommendations to the aggregate report (#2210)
@jpbruinsslot jpbruinsslot added the mula Issues related to the scheduler label Feb 7, 2024
@jpbruinsslot jpbruinsslot self-assigned this Feb 7, 2024
boefjes/boefjes/local.py Outdated Show resolved Hide resolved
boefjes/boefjes/local.py Outdated Show resolved Hide resolved
dispatched = Column(Interval)
running = Column(Interval)

meta = Column(JSONB)
Copy link
Contributor Author

@jpbruinsslot jpbruinsslot Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added meta column for runners to relay and persist information about a task, e.g. cpu, mem load, run times etc. If this info is already available through bytes we can opt to reference that. However, it might be more difficult/time-consuming to retrieve that information when we want to make rankings based on that information since we don't have it readily available in the scheduler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can opt for removing this field for now since it depends on changes or implementation in runners

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather collect this info in the scheduler for fast-access, as its something the user wants to filter on, see on the tasklist. Having those details only in Bytes would mean we end up reading those jobmeta's for the whole list / or worse for all jobs when rendering a filtered task list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question where and how we want to to store these meta values in the schedulers database is still very valid though.

modified_at=datetime.now(timezone.utc),
)
self.ctx.datastores.task_store.create_task(task)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to make the sequence pending > queued > dispatched work correctly within the scheduler

@jpbruinsslot jpbruinsslot marked this pull request as ready for review February 7, 2024 13:00
@jpbruinsslot jpbruinsslot requested a review from a team as a code owner February 7, 2024 13:00
weblate and others added 5 commits March 5, 2024 16:46
Co-authored-by: Jeroen Dekkers <[email protected]>
Co-authored-by: Joost Krapels <[email protected]> thanks ! @JKrapels
Co-authored-by: LibreTranslate <[email protected]>
Co-authored-by: Jan Klopper <[email protected]>
Co-authored-by: Jeroen Dekkers <[email protected]>
Co-authored-by: Ammar <[email protected]>
Co-authored-by: Madelon D <[email protected]>
Co-authored-by: Jeroen Dekkers <[email protected]>
Co-authored-by: stephanie0x00 <[email protected]>
* main:
  fix deprecated warning due to old env in .env-defaults (#2597)
  add extra checks for findings to dns report (#2506)
  Fix missing finding_type table (#2596)
  Translations update from Hosted Weblate (#2594)
  TLS Report unit tests (#2593)
  Filter out undeserializable objects from xtdb query in `construct_neighbour_query_multi` (#2592)
  Feature/efficient reporting for all reports (#2586)
@minvws minvws deleted a comment from CLAassistant Mar 5, 2024
@CLAassistant
Copy link

CLAassistant commented Mar 6, 2024

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
5 out of 6 committers have signed the CLA.

✅ madelondohmen
✅ originalsouth
✅ Donnype
✅ jpbruinsslot
✅ underdarknl
❌ weblate
You have signed the CLA already but the status is still pending? Let us recheck it.

@jpbruinsslot
Copy link
Contributor Author

A rebase borked the PR a bit, probably going to put the changes into a separate PR. Unless there is a way to fix the history

@@ -22,10 +22,13 @@ def test_one_process(manager: SchedulerWorkerManager, item_handler: MockHandler)

patched_tasks = manager.scheduler_client.get_all_patched_tasks()

assert len(patched_tasks) == 3
assert patched_tasks[0] == ("70da7d4f-f41f-4940-901b-d98a92e9014b", "completed")
assert len(patched_tasks) == 6
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Donnype would you mind taking a look if I'm making the correct assumptions within these tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpbruinsslot These patches seem logical right? All of them go from running to either completed or failed 👍

boefjes/Makefile Show resolved Hide resolved
boefjes/boefjes/app.py Show resolved Hide resolved
@@ -22,10 +22,13 @@ def test_one_process(manager: SchedulerWorkerManager, item_handler: MockHandler)

patched_tasks = manager.scheduler_client.get_all_patched_tasks()

assert len(patched_tasks) == 3
assert patched_tasks[0] == ("70da7d4f-f41f-4940-901b-d98a92e9014b", "completed")
assert len(patched_tasks) == 6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpbruinsslot These patches seem logical right? All of them go from running to either completed or failed 👍

boefjes/boefjes/local.py Outdated Show resolved Hide resolved
Copy link
Contributor

@dekkers dekkers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am missing a functional specification about what we are trying to achieve. If we want to store an accurate execution time of boefjes and normalizers (which is the primary reason of this change as far as I understood it), we would first need to define what we mean with with that/what we want to store. We will also be using new docker / kubernetes runners soon and I think it would also be good to take those in account.

These are just some ideas, I think we need to brainstorm/discuss this more:

  • With the docker / kubernetes runner, if the container executes one task the execution time might be the time the container has run.
  • If it would be a long-running container that will executes multiple tasks, we can't use the container execution time. One way would be to have the container report the execution time back when submitting the boefje results. We would have to trust the container to do this correctly, but I don't think this would be a big problem in practice.
  • Another option would be to calculate the time between the input being fetched in the boefjes api and the output being submitted. This might be less accurate if there is latency between boefjes api and the container or the boefjes api is overloaded. If we want to add batching in the boefjes API this would also not work.
  • For the current boefje/normalizer runner, we might take the execution time for the run function.

I think doing status updates queries from the runner to the schedule and then storing the timestamps won't give the most accurate execution time because it will include the latency of those requests. For very simple boefjes and most normalizer the overhead might even be more than the execution time. The latency might also be higher when if we are going to have runners that will be inside different internal networks in the future. If the scheduler and/or database is overloaded it would also give more inaccurate information.

The way it is currently implemented we also include the time it takes to fetch OOIs or raw files and save raw files or OOIs which I'm not sure we should include in the task duration.

I think that the best solution would be for the runner to be responsible for calculating the execution time and providing that to the scheduler to store it.

Database wise I think this can just be 32-bit integer that will store the execution time in miliseconds. That will provide a resolution of 1 ms and maximum of 24 days which should be enough. That way we would only need 4 bytes instead of 16 bytes for the PostgreSQL interval type that has very small resolution and high range.

With regards to the other timestamps being added, do we really need them all? It's not going to be the biggest bottleneck, but if we want to scale to millions of tasks, I think it is important to make the rows larger than nessary. The existing table design would also need improvements to achieve this, but that shouldn't stop us from doing new things in the most efficient way.

Pending_at and created_at will always be the same, so I don't think it is useful to have a pending_at? I'm also not sure how useful it is store queued_at and dispatched_at for every task. If I think from a user perspective, the interesting data are:

  • When was the task created (created_at timestamp)
  • When was the task executed (running_at timestamp, maybe executed_at would be a better name?)
  • How long did task execution take (duration in ms).

While the other timestamp might be useful for debugging what is going on inside the scheduler, I'm not sure that is reason enough to store them for every task in production systems. For debugging we also already have debug logging with which we can see when a task is queued, fetched by the runner, etc.

@@ -470,6 +474,16 @@ def patch_task(self, task_id: str, item: dict) -> Any:
detail="failed to update task",
) from exc

# Retrieve updated task
try:
updated_task = self.ctx.datastores.task_store.get_task_by_id(task_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you are trying to achieve with fetching the just saved task from the database.

@dekkers
Copy link
Contributor

dekkers commented Apr 8, 2024

As discussed (sorry for writing this up so late):

The duration that the user wants to see is the execution runtime of the boefje/normalizer job. The only way to do this is to have the runner provide the information.

The duration needs to be sortable with large number of tasks, because a user might want to sort the tasks by duration to see which tasks took a long time to run. So we should store the data in such a way that is easy for PostgreSQL to sort. The suration can be stored most effeciently as miliseconds using a 32-bit integer. This would be have a maximum of 2147483648 miliseconds or 24.8 days which should be more than enough.

There was also discussion about whether to store all the data in the task table or a separate table.

@dekkers
Copy link
Contributor

dekkers commented Apr 8, 2024

There was also discussion about whether to store all the data in the task table or a separate table.

I think the duration and other task data we might collect in the future such as cpu usage and mem usage can best be added in the task table itself. The reason I think this is the best is that we never update this data separate from the task table. When the runner returns this data it does that at the same time as signalling that the task is done, so we already do an update query on the task table to change the status. It should be more efficient to save the other data in the same update query instead of doing two update queries.

It also isn't useful to query the data without the task information, so in practice you would always do a join on the task table, so also for quering data there doesn't seem to be an adavantage in storing the data in a separate table. The task also runs only once, so there is no 1:n relationship that would require a separate table.

@jpbruinsslot
Copy link
Contributor Author

jpbruinsslot commented Apr 9, 2024

Considerations:

  • Do we want to and/or expect that we also want to track and query the durations of other statuses than RUNNING. If not, then we can add a single column to the task column with this duration. If so, then we need to consider if we want to have those status durations in the task table or to normalize in order to not clutter the task table.
  • CPU/mem usage metrics seem specific towards a boefje runner and might not necessarily relate to other task runners. Additionally, we're not considering other metrics like disk usage and network usage which might not be relevant to other task runners. Considering the agnostic nature of the scheduler we then we need to evaluate whether to normalize those metrics fields as well. I suggest we need to discuss this in another ticket since it also begs questions about ux: how is this going to be used, queried and visualized, and if other tools that track these metrics are better suited than to integrate this into the scheduler.

pending_at: datetime | None = Field(default_factory=lambda: datetime.now(timezone.utc))
queued_at: datetime | None = None
dispatched_at: datetime | None = None
running_at: datetime | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

started_at?


created_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
modified_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed if the only state changes are for the status that you use the ..._at fields for?

meta: dict | None = None

# Status transition timestamps
pending_at: datetime | None = Field(default_factory=lambda: datetime.now(timezone.utc))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just the same as created_at? Could we reduce this to:

created_at
queued_at
started_at
ended_at

The latter two being set by the runner?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way you know how long:

  • it takes for tasks to complete from queueing them (ended_at - queued_at)
  • When the task actually started (started_at)
  • The runtime (ended_at - started_at)
  • The complete lifecycle (ended_at - created_at) for throughput analysis.
  • Queueing time (started_at - queued_at (?))

Also, I'd imagine for longrunning tasks that you'd be interested in when they started, not when they were queued. That's I'd say a valid use-case for the started_at/ended_at approach vs. the runtimes? Unless you want to infer that through modified_at. Or the end times in general would be a relevant column to analyse

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I just thought of: I guess the default sorting is going to be the modified_at in the setup where we only save runtimes?

@jpbruinsslot jpbruinsslot marked this pull request as draft May 14, 2024 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mula Issues related to the scheduler
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

Task events for scheduler
8 participants