Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add deduplication RFD #4051

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Add deduplication RFD #4051

wants to merge 1 commit into from

Conversation

dekkers
Copy link
Contributor

@dekkers dekkers commented Jan 28, 2025

Changes

Add deduplication RFD

Issue link

You have to create an issue to link to this PR. If this really is not possible, write a very detailed description here and add this PR to the project board directly.

Please add the link to the issue after "Closes".

Closes ...

Demo

Please add some proof in the form of screenshots or screen recordings to show (off) new functionality, if there are interesting new features for end-users.

QA notes

Please add some information for QA on how to test the newly created code.


Code Checklist

  • All the commits in this PR are properly PGP-signed and verified.
  • This PR only contains functionality relevant to the issue.
  • I have written unit tests for the changes or fixes I made.
  • I have checked the documentation and made changes where necessary.
  • I have performed a self-review of my code and refactored it to the best of my abilities.
  • Tickets have been created for newly discovered issues.
  • For any non-trivial functionality, I have added integration and/or end-to-end tests.
  • I have informed others of any required .env changes files if required and changed the .env-dist accordingly.
  • I have included comments in the code to elaborate on what is not self-evident from the code itself, including references to issues and discussions online, or implicit behavior of an interface.

Checklist for code reviewers:

Copy-paste the checklist from the docs/source/templates folder into your comment.


Checklist for QA:

Copy-paste the checklist from the docs/source/templates folder into your comment.

@dekkers dekkers requested a review from a team as a code owner January 28, 2025 09:56
in the same KAT install earlier while the user might not have access to this
organization. This might be a problem for certain usage of OpenKAT, so
deduplication should be a setting that can be turned off.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mention that we can only use previous scan raw files if they are not older than the requested interval in the requesting organization.


#### Usage of environment field

This is not used directly from the BoefjeMeta, but this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing end of sentence


#### Usage of started_at, endated_at, boefje and runnable_hash fields

This isn't used by any boefje and can be safely removed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runnable hash should not be made available to the boefje itself, however, they are used by the rest of OpenKAT / can be used by the rest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "remove" means "remove from the boefje arguments" here.

#### Environment settings from previous task in other organization

This data is stored in bytes and needs to be fetched by the scheduler to
determine whether the environment settings are the same.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, we could compare the runnable hash? Which should contain all these fields / settings.
The problem however is that The scheduler can look up the previous runnable hash, it cannot (since it cannot see the current boefje workers environment settings), compute the current runnable hash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runnable hash only contains the hash of the directory files

deduplication is turned on and environment variables are passed to the boefje
runner we need to return an error because the user needs to know that the passed
variables aren't used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think another option would be to populate a list of previous valid runnable hashes, and check against those in the worker.
Eg,

  • Here's a new job: boefje + input + setting.
  • We have the following previous raw files on file [runnable_hash1, runnable_hash2]
  • The worker then computes its own runnable hash from the given job + local env vars, and only proceeds to execute when the list of available jobs does not contain the runnable_hash for its current job.

Comment on lines +14 to +17
When doing deduplication between organizations it would be possible for a user
to see that a boefje task for a certain OOI was already done for an organization
in the same KAT install earlier while the user might not have access to this
organization. This might be a problem for certain usage of OpenKAT, so
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true when we "copy" over the Bytes file with a new valid time? Or are we not considering this as an option?

organization by reusing the raw file of the earlier boefje task. It is not clear
if we can easily prevent having to save the raw file again for the other
organization. Not saving again would be a nice to have for now and an
optimization we can do later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

#### Usage of organization field

The external DB boefje currently needs the organization to fetch the data from
the external database. Obviously this means the boefje can't be deduplicated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph got me thinking about some edge cases.

To formulate the deduplucation requirements more generically: we can deduplicate boefje jobs across organizations if and only if

  1. Their inputs (input ooi, arguments and their environment set through their settings) are is identical
  2. The output of the boefje only depends on its input

Note that the external DB boefje defines DB_ORGANIZATION_IDENTIFIER though its settings, meaning that we would not have to special-case it since the inputs are not identical: as you mention in the Functional requirements we already cannot deduplicate these jobs.

Also note that I think we have assumed that the second one is "true enough" given that two jobs run within a small enough interval. However, this will not hold once we have boefje instances running in separate networks. We will have to take this into consideration once we have a way to define in which environment/network jobs can run.


#### Usage of started_at, endated_at, boefje and runnable_hash fields

This isn't used by any boefje and can be safely removed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "remove" means "remove from the boefje arguments" here.

#### Environment settings from previous task in other organization

This data is stored in bytes and needs to be fetched by the scheduler to
determine whether the environment settings are the same.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runnable hash only contains the hash of the directory files

Comment on lines +193 to +194
- The scheduler queries the stored tasks to see if a different organization has
already run on this input OOI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the time window for fetching those tasks? Since no job will rerun at all strictly reading this algorithm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants