Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics to express data cache #1146

Merged
merged 1 commit into from
Nov 20, 2024

Conversation

muddyfish
Copy link
Contributor

Description of change

Adds metrics to express data cache
Fixes a bug where getting a cache miss would be reported as an error rather than a cache miss

Relevant issues: N/A

Does this change impact existing behavior?

Adds metrics, no user facing functionality changes.

Does this change need a changelog entry in any of the crates?

No


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and I agree to the terms of the Developer Certificate of Origin (DCO).

@muddyfish muddyfish requested review from passaro and vladem November 19, 2024 16:35
@muddyfish muddyfish temporarily deployed to PR integration tests November 19, 2024 16:35 — with GitHub Actions Inactive
@muddyfish muddyfish temporarily deployed to PR integration tests November 19, 2024 16:35 — with GitHub Actions Inactive
@muddyfish muddyfish temporarily deployed to PR integration tests November 19, 2024 16:35 — with GitHub Actions Inactive
@muddyfish muddyfish temporarily deployed to PR integration tests November 19, 2024 16:35 — with GitHub Actions Inactive
@muddyfish muddyfish temporarily deployed to PR integration tests November 19, 2024 16:35 — with GitHub Actions Inactive
@muddyfish muddyfish temporarily deployed to PR integration tests November 19, 2024 16:35 — with GitHub Actions Inactive
@muddyfish muddyfish temporarily deployed to PR integration tests November 19, 2024 16:35 — with GitHub Actions Inactive
if result.is_err() {
Err(parse_get_object_error(result).map(ObjectClientError::ServiceError))
let err = parse_get_object_error(result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is problematic: the result has not been processed yet - see these comments.

I think this is the same mistake that causes issues for PUT: #1007.

Unless we review the callbacks, the correct thing to do would be to wait for the request to finish and return the fully parsed error.

}
}

fn emit_failure_metric(&self, reason: &'static str, type_: MetricsType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest inlining the metrics calls. This function makes it less clear what you are tracking.

@@ -113,11 +137,15 @@ where
block_offset: u64,
object_size: usize,
) -> DataCacheResult<Option<ChecksummedBytes>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many points of failure.
Shall we factor out the logic into an inner method and only handle the metrics here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully, we won't need handle_get_object_err anymore.

Copy link
Contributor Author

@muddyfish muddyfish Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not planning on refactoring into an inner method, as that makes the reasons harder to propagate. Did remove the handle_get_object_err function though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative could be to still split into an inner function which records the failure (and reason), but move the hit/miss metric to the wrapper.

Happy to review the approach later, though.

.get_object_metadata()
.await
.map_err(|err| DataCacheError::IoFailure(err.into()))?;
let object_metadata = match result.get_object_metadata().await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a workaround for the get_object errors above, why not get the data first here? That will guarantee to return the correct error before trying to get metadata or checksums.

@muddyfish muddyfish force-pushed the express-cache-metrics branch from 339d2ce to 31877fb Compare November 20, 2024 16:40
@muddyfish muddyfish had a problem deploying to PR integration tests November 20, 2024 16:40 — with GitHub Actions Failure
@muddyfish muddyfish had a problem deploying to PR integration tests November 20, 2024 16:40 — with GitHub Actions Failure
@muddyfish muddyfish had a problem deploying to PR integration tests November 20, 2024 16:40 — with GitHub Actions Failure
@muddyfish muddyfish had a problem deploying to PR integration tests November 20, 2024 16:40 — with GitHub Actions Failure
@muddyfish muddyfish had a problem deploying to PR integration tests November 20, 2024 16:40 — with GitHub Actions Failure
@muddyfish muddyfish had a problem deploying to PR integration tests November 20, 2024 16:40 — with GitHub Actions Failure
@muddyfish muddyfish had a problem deploying to PR integration tests November 20, 2024 16:40 — with GitHub Actions Failure
@muddyfish muddyfish force-pushed the express-cache-metrics branch from 31877fb to e6efbf1 Compare November 20, 2024 16:43
@muddyfish muddyfish had a problem deploying to PR integration tests November 20, 2024 16:43 — with GitHub Actions Failure
@muddyfish muddyfish had a problem deploying to PR integration tests November 20, 2024 16:43 — with GitHub Actions Failure
@muddyfish muddyfish had a problem deploying to PR integration tests November 20, 2024 16:43 — with GitHub Actions Failure
@muddyfish muddyfish had a problem deploying to PR integration tests November 20, 2024 16:43 — with GitHub Actions Failure
@muddyfish muddyfish had a problem deploying to PR integration tests November 20, 2024 16:43 — with GitHub Actions Failure
@muddyfish muddyfish had a problem deploying to PR integration tests November 20, 2024 16:43 — with GitHub Actions Failure
@muddyfish muddyfish had a problem deploying to PR integration tests November 20, 2024 16:43 — with GitHub Actions Failure
@muddyfish muddyfish force-pushed the express-cache-metrics branch from e6efbf1 to 95eb91b Compare November 20, 2024 16:44
@muddyfish muddyfish temporarily deployed to PR integration tests November 20, 2024 16:44 — with GitHub Actions Inactive
@muddyfish muddyfish temporarily deployed to PR integration tests November 20, 2024 16:44 — with GitHub Actions Inactive
@muddyfish muddyfish temporarily deployed to PR integration tests November 20, 2024 16:44 — with GitHub Actions Inactive
@muddyfish muddyfish temporarily deployed to PR integration tests November 20, 2024 16:44 — with GitHub Actions Inactive
@muddyfish muddyfish temporarily deployed to PR integration tests November 20, 2024 16:44 — with GitHub Actions Inactive
@muddyfish muddyfish temporarily deployed to PR integration tests November 20, 2024 16:44 — with GitHub Actions Inactive
@muddyfish muddyfish temporarily deployed to PR integration tests November 20, 2024 16:44 — with GitHub Actions Inactive
@muddyfish muddyfish requested a review from passaro November 20, 2024 16:45
Copy link
Contributor

@passaro passaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, but completely optional.

}

#[inline]
fn emit_failure_metric_write(reason: &'static str) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, non-blocking: we could just inline

@@ -113,11 +137,15 @@ where
block_offset: u64,
object_size: usize,
) -> DataCacheResult<Option<ChecksummedBytes>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative could be to still split into an inner function which records the failure (and reason), but move the hit/miss metric to the wrapper.

Happy to review the approach later, though.

@muddyfish muddyfish enabled auto-merge November 20, 2024 17:07
@muddyfish muddyfish added this pull request to the merge queue Nov 20, 2024
Merged via the queue into awslabs:main with commit f7b4524 Nov 20, 2024
23 checks passed
@muddyfish muddyfish deleted the express-cache-metrics branch November 20, 2024 17:23
sthagen pushed a commit to sthagen/awslabs-mountpoint-s3 that referenced this pull request Dec 21, 2024
…wslabs#1162)

Adds additional duration metrics to the shared cache for cache hits,
misses, and errors.

Example metrics:
```
2024-12-06T14:11:43.012775Z  INFO mountpoint_s3::metrics: express_data_cache.block_err[reason=invalid_block_offset,type=read]: 189 (n=189)
2024-12-06T14:11:43.012802Z  INFO mountpoint_s3::metrics: express_data_cache.block_hit: 0 (n=189)
2024-12-06T14:11:43.012817Z  INFO mountpoint_s3::metrics: express_data_cache.read_duration_us[type=error]: n=189: min=3 p10=3 p50=4 avg=3.87 p90=5 p99=5 p99.9=6 max=6
2024-12-06T14:11:43.012831Z  INFO mountpoint_s3::metrics: express_data_cache.total_bytes[type=write]: 380 (n=190)
2024-12-06T14:11:43.012844Z  INFO mountpoint_s3::metrics: express_data_cache.write_duration_us[type=ok]: n=190: min=8256 p10=8511 p50=8895 avg=8882.19 p90=9343 p99=9535 p99.9=9663 max=9663
```

And
```
2024-12-06T16:06:14.462602Z  INFO mountpoint_s3::metrics: express_data_cache.block_hit: 98 (n=100)
2024-12-06T16:06:14.462628Z  INFO mountpoint_s3::metrics: express_data_cache.read_duration_us[type=miss]: n=2: min=21120 p10=21247 p50=21247 avg=21824.00 p90=22527 p99=22527 p99.9=22527 max=22527
2024-12-06T16:06:14.462641Z  INFO mountpoint_s3::metrics: express_data_cache.read_duration_us[type=ok]: n=98: min=5888 p10=6015 p50=6271 avg=6378.94 p90=6559 p99=14079 p99.9=14079 max=14079
2024-12-06T16:06:14.462652Z  INFO mountpoint_s3::metrics: express_data_cache.total_bytes[type=read]: 196 (n=98)
2024-12-06T16:06:14.462663Z  INFO mountpoint_s3::metrics: express_data_cache.total_bytes[type=write]: 4 (n=2)
2024-12-06T16:06:14.462673Z  INFO mountpoint_s3::metrics: express_data_cache.write_duration_us[type=ok]: n=2: min=9408 p10=9471 p50=9471 avg=19280.00 p90=29183 p99=29183 p99.9=29183 max=29183

```

Additionally refactors the cache in response to comments in
awslabs#1146

### Does this change impact existing behavior?

Yes, the `express_data_cache.read_duration_us` metric now has a type
associated with if it was a cache hit or not.

### Does this change need a changelog entry?

No, changes to metrics don't need changelog entries.

---

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license and I agree to the terms of
the [Developer Certificate of Origin
(DCO)](https://developercertificate.org/).

---------

Signed-off-by: Simon Beal <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants