Add deduplication RFD #4051

dekkers · 2025-01-28T09:56:00Z

Changes

Add deduplication RFD

Issue link

You have to create an issue to link to this PR. If this really is not possible, write a very detailed description here and add this PR to the project board directly.

Please add the link to the issue after "Closes".

Closes ...

Demo

Please add some proof in the form of screenshots or screen recordings to show (off) new functionality, if there are interesting new features for end-users.

QA notes

Please add some information for QA on how to test the newly created code.

Code Checklist

All the commits in this PR are properly PGP-signed and verified.
This PR only contains functionality relevant to the issue.
I have written unit tests for the changes or fixes I made.
I have checked the documentation and made changes where necessary.
I have performed a self-review of my code and refactored it to the best of my abilities.

Tickets have been created for newly discovered issues.
For any non-trivial functionality, I have added integration and/or end-to-end tests.
I have informed others of any required .env changes files if required and changed the .env-dist accordingly.
I have included comments in the code to elaborate on what is not self-evident from the code itself, including references to issues and discussions online, or implicit behavior of an interface.

Checklist for code reviewers:

Copy-paste the checklist from the docs/source/templates folder into your comment.

Checklist for QA:

Copy-paste the checklist from the docs/source/templates folder into your comment.

sonarqubecloud · 2025-01-28T10:02:20Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

underdarknl · 2025-01-28T10:02:54Z

rfd/0003-deduplication.md

+in the same KAT install earlier while the user might not have access to this
+organization. This might be a problem for certain usage of OpenKAT, so
+deduplication should be a setting that can be turned off.
+


We should mention that we can only use previous scan raw files if they are not older than the requested interval in the requesting organization.

underdarknl · 2025-01-28T10:05:53Z

rfd/0003-deduplication.md

+
+#### Usage of environment field
+
+This is not used directly from the BoefjeMeta, but this


Missing end of sentence

underdarknl · 2025-01-28T10:06:46Z

rfd/0003-deduplication.md

+
+#### Usage of started_at, endated_at, boefje and runnable_hash fields
+
+This isn't used by any boefje and can be safely removed.


The runnable hash should not be made available to the boefje itself, however, they are used by the rest of OpenKAT / can be used by the rest.

I think "remove" means "remove from the boefje arguments" here.

underdarknl · 2025-01-28T10:09:02Z

rfd/0003-deduplication.md

+#### Environment settings from previous task in other organization
+
+This data is stored in bytes and needs to be fetched by the scheduler to
+determine whether the environment settings are the same.


Or, we could compare the runnable hash? Which should contain all these fields / settings.
The problem however is that The scheduler can look up the previous runnable hash, it cannot (since it cannot see the current boefje workers environment settings), compute the current runnable hash.

The runnable hash only contains the hash of the directory files

underdarknl · 2025-01-28T10:11:55Z

rfd/0003-deduplication.md

+deduplication is turned on and environment variables are passed to the boefje
+runner we need to return an error because the user needs to know that the passed
+variables aren't used.
+


I think another option would be to populate a list of previous valid runnable hashes, and check against those in the worker.
Eg,

Here's a new job: boefje + input + setting.

We have the following previous raw files on file [runnable_hash1, runnable_hash2]

The worker then computes its own runnable hash from the given job + local env vars, and only proceeds to execute when the list of available jobs does not contain the runnable_hash for its current job.

Donnype · 2025-01-29T10:07:02Z

rfd/0003-deduplication.md

+When doing deduplication between organizations it would be possible for a user
+to see that a boefje task for a certain OOI was already done for an organization
+in the same KAT install earlier while the user might not have access to this
+organization. This might be a problem for certain usage of OpenKAT, so


Is this true when we "copy" over the Bytes file with a new valid time? Or are we not considering this as an option?

Donnype · 2025-01-29T10:07:32Z

rfd/0003-deduplication.md

+organization by reusing the raw file of the earlier boefje task. It is not clear
+if we can easily prevent having to save the raw file again for the other
+organization. Not saving again would be a nice to have for now and an
+optimization we can do later.


Donnype · 2025-01-29T10:34:31Z

rfd/0003-deduplication.md

+#### Usage of organization field
+
+The external DB boefje currently needs the organization to fetch the data from
+the external database. Obviously this means the boefje can't be deduplicated.


This paragraph got me thinking about some edge cases.

To formulate the deduplucation requirements more generically: we can deduplicate boefje jobs across organizations if and only if

Their inputs (input ooi, arguments and their environment set through their settings) are is identical

The output of the boefje only depends on its input

Note that the external DB boefje defines DB_ORGANIZATION_IDENTIFIER though its settings, meaning that we would not have to special-case it since the inputs are not identical: as you mention in the Functional requirements we already cannot deduplicate these jobs.

Also note that I think we have assumed that the second one is "true enough" given that two jobs run within a small enough interval. However, this will not hold once we have boefje instances running in separate networks. We will have to take this into consideration once we have a way to define in which environment/network jobs can run.

Donnype · 2025-01-29T10:36:16Z

rfd/0003-deduplication.md

+
+#### Usage of started_at, endated_at, boefje and runnable_hash fields
+
+This isn't used by any boefje and can be safely removed.


I think "remove" means "remove from the boefje arguments" here.

Donnype · 2025-01-29T10:45:21Z

rfd/0003-deduplication.md

+#### Environment settings from previous task in other organization
+
+This data is stored in bytes and needs to be fetched by the scheduler to
+determine whether the environment settings are the same.


The runnable hash only contains the hash of the directory files

Donnype · 2025-01-29T11:33:57Z

rfd/0003-deduplication.md

+- The scheduler queries the stored tasks to see if a different organization has
+  already run on this input OOI.


What is the time window for fetching those tasks? Since no job will rerun at all strictly reading this algorithm

Add deduplication RFD

0a1f4c9

dekkers requested a review from a team as a code owner January 28, 2025 09:56

underdarknl reviewed Jan 28, 2025

View reviewed changes

Donnype reviewed Jan 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add deduplication RFD #4051

Add deduplication RFD #4051

dekkers commented Jan 28, 2025

sonarqubecloud bot commented Jan 28, 2025

underdarknl Jan 28, 2025

underdarknl Jan 28, 2025

underdarknl Jan 28, 2025

Donnype Jan 29, 2025

underdarknl Jan 28, 2025

Donnype Jan 29, 2025

underdarknl Jan 28, 2025

Donnype Jan 29, 2025

Donnype Jan 29, 2025

Donnype Jan 29, 2025

Donnype Jan 29, 2025

Donnype Jan 29, 2025

Donnype Jan 29, 2025


		#### Usage of environment field

		This is not used directly from the BoefjeMeta, but this


		#### Usage of started_at, endated_at, boefje and runnable_hash fields

		This isn't used by any boefje and can be safely removed.

		- The scheduler queries the stored tasks to see if a different organization has
		already run on this input OOI.

Add deduplication RFD #4051

Are you sure you want to change the base?

Add deduplication RFD #4051

Conversation

dekkers commented Jan 28, 2025

Changes

Issue link

Demo

QA notes

Code Checklist

Checklist for code reviewers:

Checklist for QA:

sonarqubecloud bot commented Jan 28, 2025

Quality Gate passed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment