Skip to content
This repository has been archived by the owner on Oct 19, 2023. It is now read-only.

Delete data and anonymize the remaining records #69

Open
Tjitse-E opened this issue Apr 26, 2021 · 8 comments
Open

Delete data and anonymize the remaining records #69

Tjitse-E opened this issue Apr 26, 2021 · 8 comments
Milestone

Comments

@Tjitse-E
Copy link
Contributor

The idea is that we will delete all of the older customer data (for example, delete customers that have been created more than 30 days ago), so that the DB dump will be a lot smaller + reducing Masquerade execution time. The remaining data should be anonymized so we can use it anymwhere.

Example config:

  customer_grid_flat:
    provider:
      delete: true
      where: "`created_at` < now() - interval 30 day"
    columns:
      name:
        formatter: name
      email:
        formatter: email
        unique: true
        nullColumnBeforeRun: true
      dob:
        formatter: dateTimeThisCentury
        optional: true
      billing_full:
         ....

Currently, Masquerade just executes the delete, then it moves on to the next table, leaving the remaining records in the table anonymized. Very logical, but it would be nice to have the possibility to delete AND anonymize.

What would be the best place to implement this feature?

@peterjaap
Copy link
Contributor

Maybe @johnorourke might have an idea about this, since he built the delete part?

@johnorourke
Copy link
Contributor

@peterjaap The original design for that was "you can either delete or anonymize, not both", but this is a good idea. We have several possible requirements:

  • delete a selection of records
  • anonymize a selection of records
  • both (perhaps with different 'where' statements)
  • none

So for maximum flexibility maybe we need to just allow different 'where' statements for anonymisation and deletion. However, delete: true previously switched off anonymisation!

Perhaps this approach:

  • delete_where to specify the records to be deleted
  • anonymize_where to specify the records to be anonymized
  • where would fill in both of those - which keeps backwards compatibility
  • The system would run the delete first (if delete:true), then the anonymize - exactly as it does now.

@IvanChepurnyi I can see your work on the DataProvider system, so it would be good to get your input on this. Should we avoid backward compatibility and go for a generic "actions" config, instead of using implcit actions? It's a balance between easy config with "sensible defaults", the learning curve for new users, and reducing unexpected behaviour.

@Tjitse-E
Copy link
Contributor Author

@johnorourke i'm currently using delete_where in our builds (master...Tjitse-E:feature/partial-delete). The only problem there is that it is not backwards compatible, but this could be solved (if needed) by keeping where.

Adding both delete_where and anonymize_where seems like a good idea.

@IvanChepurnyi
Copy link
Contributor

IvanChepurnyi commented Apr 30, 2021

@johnorourke I like your approach, and if where is used for both delete and anonymize it won't break behavior as the anonymization step just will be 0 rows, as those were previously deleted.

There is probably an opportunity to hide this logic behind the TableConfigution class as checks for provider/where become quite complex. I will work on this issue next week.

@SAN1TAR1UM
Copy link

Watching this, as I'm also interested in this feature. Until then, is it possible to run masquerade twice with two different configs?

I'm thinking I can run the anon, then export for a full anon backup.
Then come back and run the delete on the same db, then export a thin backup

Only problem is I need two different config file setups for this correct? I guess I could run two different phar's each with their own config, but that doesn't seem very elegant.

@johnorourke
Copy link
Contributor

@SAN1TAR1UM the --config parameter (which gives it a directory of config yml files) can be used multiple times, so you can use the same phar but just add an extra set of configs for one of the runs.

@peterjaap peterjaap added this to the sdf milestone Nov 11, 2021
@mehdichaouch
Copy link

mehdichaouch commented Mar 14, 2023

I read all and I have a question, why just not having in yaml file:

  customer_grid_flat:
    provider:
      delete: true
      where: "`created_at` < now() - interval 30 day"

and this below:

  customer_grid_flat:
    columns:
      name:
        formatter: name
      email:
        formatter: email
        unique: true
        nullColumnBeforeRun: true
      dob:
        formatter: dateTimeThisCentury
        optional: true
      billing_full:
      ...

First block will clean, second one will anonymize?

@johnorourke
Copy link
Contributor

First block will clean, second one will anonymize?

@mehdichaouch I think multiple configs for the same table are ignored - the latest one wins, due to using array_merge here: https://github.com/elgentos/masquerade/blob/master/src/Elgentos/Masquerade/Helper/Config.php#L80

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants