Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed an issue that could cause checksum mismatch errors in S3 uploads. #5836

Merged
merged 5 commits into from
Jan 30, 2025

Conversation

millems
Copy link
Contributor

@millems millems commented Jan 29, 2025

This error is sometimes encountered when a customer uses:

  1. The AsyncS3Client
  2. A ChecksumAlgorithm of SHA1 or SHA256 (instead of the default CRC32)
  3. Parallel uploads

The root cause was the SDK using thread locals to cache the SHA1 or SHA256 message digest implementations. This meant that if a single event loop thread was processing multiple requests, those two requests would use the same digest implementation to calculate the checksum.

This PR updates the SHA1 and SHA256 (and MD5, though it's not used by S3) to use a LIFO cache.

This error is sometimes encountered when a customer uses:
1. The AsyncS3Client
2. A ChecksumAlgorithm of SHA1 or SHA256 (instead of the default CRC32)
3. Parallel uploads

The root cause was the SDK using thread locals to cache the SHA1 or SHA256 message digest implementations. This meant that if a single event loop thread was processing multiple requests, those two requests would use the same digest implementation to calculate the checksum.
@millems millems requested a review from a team as a code owner January 29, 2025 00:43

// Avoid over-caching after large traffic bursts. The maximum chosen here is arbitrary. It's also not strictly
// enforced, since these statements aren't synchronized.
if (digestCache.size() <= MAX_CACHED_DIGESTS) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size() may be expensive, should we track the size using atomic integer instead?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you coordinate the atomic size and the concurrent deque?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll switch to a linked blocking deque. Uses locking, but it's still likely faster than creating new checksums and has a constant-time size() method. We can do benchmarking later to verify this is the fastest method.

Is that reasonable?

DigestThreadLocal(String algorithmName) {
this.algorithmName = algorithmName;
/**
* Retrieve the message digest bytes. This will close the message digest when invoked. This is because the underlying
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: where do we close messageDigest in this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, that needs a test added.


// Avoid over-caching after large traffic bursts. The maximum chosen here is arbitrary. It's also not strictly
// enforced, since these statements aren't synchronized.
if (digestCache.size() <= MAX_CACHED_DIGESTS) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you coordinate the atomic size and the concurrent deque?

throw new RuntimeException("Unable to fetch message digest instance for Algorithm "
+ algorithmName + ": " + e.getMessage(), e);
return new CloseableMessageDigest((MessageDigest) digest.get().clone());
} catch (CloneNotSupportedException e) { // should never occur
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any time a comment says "should never occur", it seems to happen. Why can't this method declare that it throws CloneNotSupportedException instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CloneNotSupportedException is a checked exception. Because the SDKs don't throw checked exceptions, the callers would need to wrap it in an unchecked exception themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've improved the exception message if this scenario does happen.

Comment on lines 106 to 107
// Avoid over-caching after large traffic bursts. The maximum chosen here is arbitrary. It's also not strictly
// enforced, since these statements aren't synchronized.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to have a "prefilled" cache of lazy loaded digests, and then when we can't use a cached digest, we have a different instance that just closes and doesn't need to interact with the cache?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an option. It might be trickier to decide on the size of the cache, and then it means that not releasing a message digest to the cache will have long-term performance implications if it's one of those special "cached" digests. The advantage to this implementation is that the odd error that could fail to release the digest back to the cache doesn't really hurt.

digestCache.addFirst(digest.get());
}
// Drop this digest is the cache is full.
digestCache.offerFirst(digest.get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@millems millems added this pull request to the merge queue Jan 29, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 29, 2025
@millems millems added this pull request to the merge queue Jan 29, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 29, 2025
@millems millems added this pull request to the merge queue Jan 30, 2025
Merged via the queue into master with commit c137348 Jan 30, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants