-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics to express data cache #1146
Conversation
if result.is_err() { | ||
Err(parse_get_object_error(result).map(ObjectClientError::ServiceError)) | ||
let err = parse_get_object_error(result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
} | ||
|
||
fn emit_failure_metric(&self, reason: &'static str, type_: MetricsType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest inlining the metrics calls. This function makes it less clear what you are tracking.
@@ -113,11 +137,15 @@ where | |||
block_offset: u64, | |||
object_size: usize, | |||
) -> DataCacheResult<Option<ChecksummedBytes>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too many points of failure.
Shall we factor out the logic into an inner method and only handle the metrics here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully, we won't need handle_get_object_err
anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not planning on refactoring into an inner method, as that makes the reasons harder to propagate. Did remove the handle_get_object_err
function though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative could be to still split into an inner function which records the failure (and reason), but move the hit/miss metric to the wrapper.
Happy to review the approach later, though.
.get_object_metadata() | ||
.await | ||
.map_err(|err| DataCacheError::IoFailure(err.into()))?; | ||
let object_metadata = match result.get_object_metadata().await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a workaround for the get_object
errors above, why not get the data first here? That will guarantee to return the correct error before trying to get metadata or checksums.
339d2ce
to
31877fb
Compare
31877fb
to
e6efbf1
Compare
Signed-off-by: Simon Beal <[email protected]>
e6efbf1
to
95eb91b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments, but completely optional.
} | ||
|
||
#[inline] | ||
fn emit_failure_metric_write(reason: &'static str) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, non-blocking: we could just inline
@@ -113,11 +137,15 @@ where | |||
block_offset: u64, | |||
object_size: usize, | |||
) -> DataCacheResult<Option<ChecksummedBytes>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative could be to still split into an inner function which records the failure (and reason), but move the hit/miss metric to the wrapper.
Happy to review the approach later, though.
…wslabs#1162) Adds additional duration metrics to the shared cache for cache hits, misses, and errors. Example metrics: ``` 2024-12-06T14:11:43.012775Z INFO mountpoint_s3::metrics: express_data_cache.block_err[reason=invalid_block_offset,type=read]: 189 (n=189) 2024-12-06T14:11:43.012802Z INFO mountpoint_s3::metrics: express_data_cache.block_hit: 0 (n=189) 2024-12-06T14:11:43.012817Z INFO mountpoint_s3::metrics: express_data_cache.read_duration_us[type=error]: n=189: min=3 p10=3 p50=4 avg=3.87 p90=5 p99=5 p99.9=6 max=6 2024-12-06T14:11:43.012831Z INFO mountpoint_s3::metrics: express_data_cache.total_bytes[type=write]: 380 (n=190) 2024-12-06T14:11:43.012844Z INFO mountpoint_s3::metrics: express_data_cache.write_duration_us[type=ok]: n=190: min=8256 p10=8511 p50=8895 avg=8882.19 p90=9343 p99=9535 p99.9=9663 max=9663 ``` And ``` 2024-12-06T16:06:14.462602Z INFO mountpoint_s3::metrics: express_data_cache.block_hit: 98 (n=100) 2024-12-06T16:06:14.462628Z INFO mountpoint_s3::metrics: express_data_cache.read_duration_us[type=miss]: n=2: min=21120 p10=21247 p50=21247 avg=21824.00 p90=22527 p99=22527 p99.9=22527 max=22527 2024-12-06T16:06:14.462641Z INFO mountpoint_s3::metrics: express_data_cache.read_duration_us[type=ok]: n=98: min=5888 p10=6015 p50=6271 avg=6378.94 p90=6559 p99=14079 p99.9=14079 max=14079 2024-12-06T16:06:14.462652Z INFO mountpoint_s3::metrics: express_data_cache.total_bytes[type=read]: 196 (n=98) 2024-12-06T16:06:14.462663Z INFO mountpoint_s3::metrics: express_data_cache.total_bytes[type=write]: 4 (n=2) 2024-12-06T16:06:14.462673Z INFO mountpoint_s3::metrics: express_data_cache.write_duration_us[type=ok]: n=2: min=9408 p10=9471 p50=9471 avg=19280.00 p90=29183 p99=29183 p99.9=29183 max=29183 ``` Additionally refactors the cache in response to comments in awslabs#1146 ### Does this change impact existing behavior? Yes, the `express_data_cache.read_duration_us` metric now has a type associated with if it was a cache hit or not. ### Does this change need a changelog entry? No, changes to metrics don't need changelog entries. --- By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and I agree to the terms of the [Developer Certificate of Origin (DCO)](https://developercertificate.org/). --------- Signed-off-by: Simon Beal <[email protected]>
Description of change
Adds metrics to express data cache
Fixes a bug where getting a cache miss would be reported as an error rather than a cache miss
Relevant issues: N/A
Does this change impact existing behavior?
Adds metrics, no user facing functionality changes.
Does this change need a changelog entry in any of the crates?
No
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and I agree to the terms of the Developer Certificate of Origin (DCO).