Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

file size limit for upload/ingest (optional and configurable) #4387

Merged
merged 5 commits into from
Dec 23, 2024

Conversation

twrichards
Copy link
Contributor

@twrichards twrichards commented Dec 17, 2024

Recently the image-loader boxes got completely choked up and the queue grew in size. This was down to multiple whopping great images (e.g. 1GB), this was in the sweet spot (or rather not-so-sweet spot) between being small enough to successfully process and so big that it causes OoM and the box falls over and is recycled (still not great, but at least eventually self-healing) whereas the boxes stayed alive (passing healthcheck) but ultimately not completing the processing of the images.

What does this change?

This PR introduces an optional limit via a new config option upload.limit.mb (in common.conf, as it's used by both kahuna and image-loader.

This limit is imposed in two places...

  • As it picks an item off the ingest queue in image-loader - so as to guard against images from any source (agency, photographer over FTP or UI upload) - which means we retain a copy of the offending image in the fail bucket for closer inspection.

  • For UI uploads (kahuna), to avoid burning the planet by wastefully sending all those bytes to S3 only to have the server tell us the file is too big, we can alert the user promptly in the client, like so...

image

How should a reviewer test this change?

Try dropping an image over 500MB (e.g. download https://media.gutools.co.uk/images/25287e288b9254437e65209b541aa8e8895d571e) into the ingest queue bucket (in a dir with your name on it) and you should see this in the logs (as it's not a UI upload...
image
...and the file should end up in the fail bucket.

Try uploading the same file to the UI and you should see a much prompter failure.

How can success be measured?

Who should look at this?

Tested? Documented?

  • locally by committer
  • locally by Guardian reviewer
  • on the Guardian's TEST environment
  • relevant documentation added or amended (if needed)

Copy link

github-actions bot commented Dec 17, 2024

@twrichards twrichards force-pushed the file-size-limit branch 3 times, most recently from 4fdf17f to 66e2cc5 Compare December 17, 2024 22:49
@twrichards twrichards marked this pull request as ready for review December 17, 2024 23:07
@twrichards twrichards requested review from a team as code owners December 17, 2024 23:07
@paperboyo
Copy link
Contributor

No image should legitimately be over 500MB

Hmm, this is not strictly true, as, supporting TIFFs, we can encounter, if rarely, genuinely larger ones, because of all the stuff we don’t care about: layers and such. But it is true that most images like this are a result of user error in creating them and not genuine. And Grid isn’t used as an archive for genuine authored multi-layered ones (we may hit this case if we will want to store editable master files from Photoshop integration, for example).

Secondly, it’s the quality of our imaging pipeline that often makes it worse/bombs: lack of parallelism but also bad habits around TIFFs etc (those should be helped by and also fixed in #4068).

Thirdly, it’s often dimensions, not filesize, and one does not always follow the other.

All that said, this is sensible and practical stop gap and prevents someone uploading, mostly absurd, image repeatedly bringing the whole thing down. Downside is: users don’t know how to reduce size effectively losslessly.

Maybe, I would make it configurable (and not effective if not configured)?

@RichardLe
Copy link
Collaborator

Hi Folks, Agree that there should be a limit on filesize - would be good for it be configurable. We also need to have the ability to handle >500MB files - we're likely to get some large scans from archives (either by error or design) which could conceivably go over this size. Short term happy with picking failures up from the bucket, medium term would be good to get more targeted handling, potentially notifications to a large image topic (or something like that)

@twrichards twrichards changed the title limit file size on ingest to 500MB file size limit for upload/ingest (optional and configurable) Dec 18, 2024
@twrichards
Copy link
Contributor Author

twrichards commented Dec 18, 2024

@paperboyo @RichardLe - as per your comments its now optional and configurable via upload.limit.mb in common.conf (see 096e761)

Copy link
Contributor

@bryophyta bryophyta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested both UI and S3 uploads in TEST and all working as described 👍👍

@prout-bot
Copy link

Seen on auth, usage, image-loader, metadata-editor, thrall, leases, cropper, collections, media-api, kahuna (merged by @twrichards 8 minutes and 32 seconds ago) Please check your changes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants