-
Notifications
You must be signed in to change notification settings - Fork 866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed an issue that could cause checksum mismatch errors in S3 uploads. #5836
Conversation
This error is sometimes encountered when a customer uses: 1. The AsyncS3Client 2. A ChecksumAlgorithm of SHA1 or SHA256 (instead of the default CRC32) 3. Parallel uploads The root cause was the SDK using thread locals to cache the SHA1 or SHA256 message digest implementations. This meant that if a single event loop thread was processing multiple requests, those two requests would use the same digest implementation to calculate the checksum.
|
||
// Avoid over-caching after large traffic bursts. The maximum chosen here is arbitrary. It's also not strictly | ||
// enforced, since these statements aren't synchronized. | ||
if (digestCache.size() <= MAX_CACHED_DIGESTS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size() may be expensive, should we track the size using atomic integer instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you coordinate the atomic size and the concurrent deque?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll switch to a linked blocking deque. Uses locking, but it's still likely faster than creating new checksums and has a constant-time size() method. We can do benchmarking later to verify this is the fastest method.
Is that reasonable?
DigestThreadLocal(String algorithmName) { | ||
this.algorithmName = algorithmName; | ||
/** | ||
* Retrieve the message digest bytes. This will close the message digest when invoked. This is because the underlying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: where do we close messageDigest in this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, that needs a test added.
|
||
// Avoid over-caching after large traffic bursts. The maximum chosen here is arbitrary. It's also not strictly | ||
// enforced, since these statements aren't synchronized. | ||
if (digestCache.size() <= MAX_CACHED_DIGESTS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you coordinate the atomic size and the concurrent deque?
throw new RuntimeException("Unable to fetch message digest instance for Algorithm " | ||
+ algorithmName + ": " + e.getMessage(), e); | ||
return new CloseableMessageDigest((MessageDigest) digest.get().clone()); | ||
} catch (CloneNotSupportedException e) { // should never occur |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any time a comment says "should never occur", it seems to happen. Why can't this method declare that it throws CloneNotSupportedException instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CloneNotSupportedException
is a checked exception. Because the SDKs don't throw checked exceptions, the callers would need to wrap it in an unchecked exception themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've improved the exception message if this scenario does happen.
// Avoid over-caching after large traffic bursts. The maximum chosen here is arbitrary. It's also not strictly | ||
// enforced, since these statements aren't synchronized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to have a "prefilled" cache of lazy loaded digests, and then when we can't use a cached digest, we have a different instance that just closes and doesn't need to interact with the cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an option. It might be trickier to decide on the size of the cache, and then it means that not releasing a message digest to the cache will have long-term performance implications if it's one of those special "cached" digests. The advantage to this implementation is that the odd error that could fail to release the digest back to the cache doesn't really hurt.
…tAlgorithm cache.
digestCache.addFirst(digest.get()); | ||
} | ||
// Drop this digest is the cache is full. | ||
digestCache.offerFirst(digest.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
… sends the payload, and then a retry happens with checksums enabled.
Quality Gate passedIssues Measures |
This error is sometimes encountered when a customer uses:
The root cause was the SDK using thread locals to cache the SHA1 or SHA256 message digest implementations. This meant that if a single event loop thread was processing multiple requests, those two requests would use the same digest implementation to calculate the checksum.
This PR updates the SHA1 and SHA256 (and MD5, though it's not used by S3) to use a LIFO cache.