Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Our thoughts on OCFL over S3 #522

Closed
marcolarosa opened this issue Nov 4, 2020 · 7 comments
Closed

Our thoughts on OCFL over S3 #522

marcolarosa opened this issue Nov 4, 2020 · 7 comments

Comments

@marcolarosa
Copy link

Noting #372

This ticket is about our thinking around how to model an OCFL repo on S3. We have not implemented this yet which is why we're looking for feedback here.

Our current demonstrator with 70TB in OCFL

We've built a demonstrator with about 70TB of data in OCFL (which you can see at http://115.146.80.165/) but for various reasons we need to bring forward our work on developing this backend option. Our id paths are as follows:

  • by convention all of our objects have an id as /{domain}/{collectionId}/{itemId}
  • this internal identifier is hashed with SHA512 and that becomes the OCFL id
    • we've chosen to do it this way so that we don't have to worry about special characters and filesystem paths as the hashing deals with that
    • by prefixing with domain we can host multiple org's on a single fileystem without id clashes provided they ensure their own ids are unique
  • the SHA512 is pairtree'd to get to the object path.

Models for OCFL on S3

We think there are two ways to move from a filesystem to S3:

  1. the OCFL repo lives inside a single bucket
  2. each OCFL object is its own bucket

The repo lives inside a single bucket

At first glance this seems like the easiest option but on further thought we don't think this solution will scale. Given that objects inside an S3 bucket are not actually hierarchical we would expect the performance of operations on this bucket would decrease as the amount of content inside it grows. Obviously this is complicated by the all of the extra path elements coming from the pairtree'd SHA512 ids.

Each OCFL object is its own bucket

In this model the SHA512 id would be the bucket name. We think this is the better option as we expect that the infra underpinning the S3 system would be optimised for mapping a bucket ID to a storage path in the cloud. Within that bucket one would find an OCFL object in the expected form. Although the performance of the bucket would degrade as the number of versions / items inside it increases this would be trivial compared to the degradation of adding a whole object (and associated paths and stuff) into a bucket as in the first model.

Our current nodejs lib - practical considerations

We have a nodejs package that we use to interact with OCFL: https://github.com/CoEDL/ocfl-js. One of the key ideas is that updates happen outside of the OCFL hierarchy. Specifically, the library creates a deposit path and a backup path when updating an object as a way of locking changes to the OCFL object whilst an update is in progress and ensuring an atomic move once the update has completed.

If the whole OCFL repo lived inside an S3 bucket this process of creating and working with a deposit and backup path would be cumbersome. However, having a bucket per ocfl means that it would have to live outside of the S3 system. This has pro's and con's. The pro's mean that all object operations happen on a server outside of S3 (which I don't think can be avoided anyway) but it would require the library to first pull the whole object down before it could operate on it. In the case of very large objects (a few TB) this would result in quite significant slowdowns. (There might be ways to avoid this by using the ETAG provided by AWS but that's not really the point of this thread).

So - how does this sound to people? Is there something missing here? Are there already examples that have tackled this question at scale?

@pwinckles
Copy link

@marcolarosa Have you successfully created large numbers of buckets in the past? My understanding is that by default an AWS account may only have up to 100 buckets, and that this cap may be increased to a maximum of 1,000 (AWS reference).

You shouldn't need to download the entire object in order to update it. I wouldn't expect that you'd need anything other than a copy of the most recent inventory.

@marcolarosa
Copy link
Author

@pwinckles Actually, I didn't know there were limits to how many buckets one could create! And looking at the reference page you linked the SHA512 id as bucket name is also not allowed as it's too long.

So I guess we need to think this through some more!

Has anyone else tried using S3 as a backend? What were the design decisions and why were they taken?

@marcolarosa
Copy link
Author

Actually, this might be workable...

There are only 256 top level folders when pairtreeing a SHA512 id (00 - ff) so it is partially workable with a limit request increase and a naming convention for bucket names so that one could map from a SHA512 id to the correct AWS bucket:

e.g. 00c7b262.... => pairtree: 00/c7/b2/62 --> my.reverse.domain.ocfl.00 and the rest of the path exists inside the bucket...

@pwinckles
Copy link

There is a working implementation in ocfl-java. I would not say that it's doing anything particularly clever or unexpected. It basically just maps storage paths directly to keys within a bucket, so what you see in a bucket is essentially the same as if it was on the filesystem, and it use a DB for locking and resolving eventual consistency issues.

As for performance. It is very slow compared to using the local filesystem, but perhaps no slower than the server is able to transfer data to S3. It has not been tested "at scale" yet.

Sharding across buckets is an interesting idea, though it does feel a bit like a premature optimization unless you're confident it's going to be an issue. S3 scales buckets based on their load and as long as your keys are well distributed, which I would expect them to be because they're prefixed with a hash, I would expect it to be able to cope.

@marcolarosa
Copy link
Author

though it does feel a bit like a premature optimization unless you're confident it's going to be an issue

I have no idea which is why I wanted to reach out here and see what experience others have had when working with S3 at scale. My experience of S3 is limited to very simple usage with not a lot of data. However, now I'm potentially uploading 30,000+ OCFL objects containing 106TB of data. It's been a hard slog getting 70 out of the backup system onto a disk so if there's a best practice out there i'd like to start from there.

@ptsefton
Copy link

ptsefton commented Feb 10, 2021

I have a had a couple of thoughts about OCFL on S3 - these are just musings.

When @marcolarosa was exploring this and discussing the limits of S3 such as the number of buckets and path length he mentioned that one of the things that your would lose with an S3 implementation would be the ability to inspect a filesystem to see what's where. I wonder if there could be a hybrid implementation mode where you keep the OCFL structure as a "skeleton" on a standard file systems but the payload files actually contain URIs or similar that points to the content. So open up somefile.mp4 or somefile.mp4.oclf_link and it would have in it text contents like http://example.com/my-repo/some_hash (I don't know what a URI or ID for S3 might look like.

Bit of a hack, yes but it would (a) preserve the ability for naive users to find files and where they are meant to be and (b) would allow for storage-by-hash in a remote service giving you repository-wide de-duplication (rather than the object level you get at the moment).

@neilsjefferies
Copy link
Member

Merged into #372

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants