-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rebuild reaper
functionality in thrall
AND remove old reaper
lambda
#4145
Conversation
32e6b2d
to
750cb58
Compare
750cb58
to
be3a857
Compare
doBatchSoftReap
and doBatchHardReap
endpoints to thrall
AND remove old reaper
lambdareaper
functionality in thrall
AND remove old reaper
lambda
1b10d23
to
71509eb
Compare
31c30ac
to
0a3d33f
Compare
def doBatchSoftReap(count: Int): Action[AnyContent] = batchDeleteWrapper(count)(doBatchSoftReap) | ||
|
||
def doBatchSoftReap(count: Int, deletedBy: String, isReapable: ReapableEligibility): Future[JsValue] = persistedBatchDeleteOperation("soft"){ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional for this PR, but I'd suggest building up a logMarker with the request ID (example https://github.com/guardian/grid/blob/main/cropper/app/controllers/CropperController.scala#L55) and passing it around and feeding it into the log statements; may make life easier tracking what happened in a given request. (in normal operation there won't be much ambiguity because calls are 15 mins apart, but it would allow you to quickly filter away any other thrall logs, or useful if we make a bunch of calls quickly to manually clear a backlog)
common-lib/src/main/scala/com/gu/mediaservice/lib/ImageIngestOperations.scala
Outdated
Show resolved
Hide resolved
04f354a
to
04565d0
Compare
Seen on auth, usage, image-loader, metadata-editor, thrall, leases, cropper, collections, media-api, kahuna (merged by @twrichards 13 minutes and 38 seconds ago) Please check your changes! |
https://trello.com/c/DrGAH8Y0/893-turn-on-the-reaper
Once upon a time, there was a process (in the form of a lambda) called 'the reaper' which deleted images (on a regular schedule) accordingly to a list of criteria, but was turned off out of caution after a significant chunk of images were permanently lost some years ago. This PR rebuilds 'the reaper', this time all within
thrall
.Pre-requisite PRs:
common-lib
(along with supporting classes) #4143SoftDeletedMetadataTable
tocommon-lib
#4144What's changed
ThrallConfig
(s3.reaper.bucket
inthrall.conf
) to specify the bucket name where the permanent records of what was soft & hard deleted via the reaper will be stored (see https://github.com/guardian/editorial-tools-platform/pull/706 for Guardian) - defining this property is required for the reaper to operatethrall
both takingcount
query param (for the batch size, max 1000) ...doBatchSoftReap
which 'soft deletes' (with deletedBy beingreaper
) the oldest batch ofis:reapable
images which are not already-soft deleteddoBatchHardReap
which 'hard deletes' the oldest batch ofis:reapable
images which have been in 'soft deleted' state for at least two weeksReaperController
which defines contains the above endpoints also has a 'schedule' (every 15mins) which [IF thes3.reaper.bucket
config property is defined, otherwise doesn't run]...doBatchSoftReap
anddoBatchHardReap
with the count as number of images ingested per 15mins - this ensures we delete at same rate we ingest for a given environment (at the Guardian, ourTEST
environment ingests roughly 1% of whatPROD
ingests)PAUSED
at the root of the new buckets3.reaper.bucket
. This is checked on each execution of the schedule, and exists early (with log message) if paused.ReaperController
provides a couple more endpointsPOST
endpoint for pausing (creating thatPAUSED
file as described above)POST
endpoint for resuming from paused state (deleting thatPAUSED
file as described above)GET
endpoint for reading a record file from the bucketNOTE: we have the endpoints exposed (in addition to being called by the schedule) so that they can be manually called to for example clear a backlog if the reaper hasn't been running for whatever reason (either historically or because it was paused using the functionality above)