-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add deduplication RFD #4051
base: main
Are you sure you want to change the base?
Add deduplication RFD #4051
Conversation
Quality Gate passedIssues Measures |
in the same KAT install earlier while the user might not have access to this | ||
organization. This might be a problem for certain usage of OpenKAT, so | ||
deduplication should be a setting that can be turned off. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should mention that we can only use previous scan raw files if they are not older than the requested interval in the requesting organization.
|
||
#### Usage of environment field | ||
|
||
This is not used directly from the BoefjeMeta, but this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing end of sentence
|
||
#### Usage of started_at, endated_at, boefje and runnable_hash fields | ||
|
||
This isn't used by any boefje and can be safely removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runnable hash should not be made available to the boefje itself, however, they are used by the rest of OpenKAT / can be used by the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "remove" means "remove from the boefje arguments" here.
#### Environment settings from previous task in other organization | ||
|
||
This data is stored in bytes and needs to be fetched by the scheduler to | ||
determine whether the environment settings are the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, we could compare the runnable hash? Which should contain all these fields / settings.
The problem however is that The scheduler can look up the previous runnable hash, it cannot (since it cannot see the current boefje workers environment settings), compute the current runnable hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runnable hash only contains the hash of the directory files
deduplication is turned on and environment variables are passed to the boefje | ||
runner we need to return an error because the user needs to know that the passed | ||
variables aren't used. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think another option would be to populate a list of previous valid runnable hashes, and check against those in the worker.
Eg,
- Here's a new job: boefje + input + setting.
- We have the following previous raw files on file [runnable_hash1, runnable_hash2]
- The worker then computes its own runnable hash from the given job + local env vars, and only proceeds to execute when the list of available jobs does not contain the runnable_hash for its current job.
When doing deduplication between organizations it would be possible for a user | ||
to see that a boefje task for a certain OOI was already done for an organization | ||
in the same KAT install earlier while the user might not have access to this | ||
organization. This might be a problem for certain usage of OpenKAT, so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true when we "copy" over the Bytes file with a new valid time? Or are we not considering this as an option?
organization by reusing the raw file of the earlier boefje task. It is not clear | ||
if we can easily prevent having to save the raw file again for the other | ||
organization. Not saving again would be a nice to have for now and an | ||
optimization we can do later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
#### Usage of organization field | ||
|
||
The external DB boefje currently needs the organization to fetch the data from | ||
the external database. Obviously this means the boefje can't be deduplicated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph got me thinking about some edge cases.
To formulate the deduplucation requirements more generically: we can deduplicate boefje jobs across organizations if and only if
- Their inputs (input ooi, arguments and their environment set through their settings) are is identical
- The output of the boefje only depends on its input
Note that the external DB boefje defines DB_ORGANIZATION_IDENTIFIER
though its settings, meaning that we would not have to special-case it since the inputs are not identical: as you mention in the Functional requirements we already cannot deduplicate these jobs.
Also note that I think we have assumed that the second one is "true enough" given that two jobs run within a small enough interval. However, this will not hold once we have boefje instances running in separate networks. We will have to take this into consideration once we have a way to define in which environment/network jobs can run.
|
||
#### Usage of started_at, endated_at, boefje and runnable_hash fields | ||
|
||
This isn't used by any boefje and can be safely removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "remove" means "remove from the boefje arguments" here.
#### Environment settings from previous task in other organization | ||
|
||
This data is stored in bytes and needs to be fetched by the scheduler to | ||
determine whether the environment settings are the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runnable hash only contains the hash of the directory files
- The scheduler queries the stored tasks to see if a different organization has | ||
already run on this input OOI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the time window for fetching those tasks? Since no job will rerun at all strictly reading this algorithm
Changes
Add deduplication RFD
Issue link
You have to create an issue to link to this PR. If this really is not possible, write a very detailed description here and add this PR to the project board directly.
Please add the link to the issue after "Closes".
Closes ...
Demo
Please add some proof in the form of screenshots or screen recordings to show (off) new functionality, if there are interesting new features for end-users.
QA notes
Please add some information for QA on how to test the newly created code.
Code Checklist
.env
changes files if required and changed the.env-dist
accordingly.Checklist for code reviewers:
Copy-paste the checklist from the docs/source/templates folder into your comment.
Checklist for QA:
Copy-paste the checklist from the docs/source/templates folder into your comment.