Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AggregationSelector is not needed anymore #2085

Merged

Conversation

fraillt
Copy link
Contributor

@fraillt fraillt commented Sep 8, 2024

Hello,

I'm not sure if this is a right way to approach this, but basically I think that AggregationSelector trait, and many with_aggregation_selector configuration methods on various public SDK types is not needed anymore, because newly introduced Views feature fully replaces this trait and removing AggregationSelector only makes things simpler/more understandable.

I created this PR as a proof to show that we can simply remove it without loosing any current functionality.

The only issue is this sentence in SDK:

If the MeterProvider has no View registered, take the Instrument and apply the default Aggregation on the basis of instrument kind according to the MetricReader instance’s aggregation property.

If I'm not missing anything, I think it could be changed to:

If the MeterProvider has no View registered, take the Instrument and apply the default aggregation.

What do you think, do you agree, or I'm missing something important here?

Changes

  • Remove AggregationSelector trait
  • Remove with_aggregation_selector configuration function from many public types.

Merge requirement checklist

  • CONTRIBUTING guidelines followed
  • Unit tests added/updated (if applicable)
  • Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
  • Changes in public API reviewed (if applicable)

@fraillt fraillt requested a review from a team September 8, 2024 17:27
Copy link

linux-foundation-easycla bot commented Sep 8, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: fraillt / name: Mindaugas Vinkelis (0f016ce)

@cijothomas
Copy link
Member

cijothomas commented Sep 8, 2024

This looks okay to me, thanks. We need have CLA agreement signed (one-time) before accepting any changes to the repo, as this is the mandatory requirement from OpenTelemetry/Linux-Foundation organization which hosts the project.

Though Views allow one to change default aggregation, I haven't observed demand for that capability yet!. That capability is not exposed in OTel .NET Metrics, and no one has ever asked for it so far (its been GA since late 2021). We could even remove that capability from OTel Rust too for first stable release to keep public API surface minimal, but this can be separately discussed.

@fraillt fraillt force-pushed the aggregation-selector-not-needed branch from 465a021 to 9a02aab Compare September 8, 2024 17:49
@fraillt
Copy link
Contributor Author

fraillt commented Sep 9, 2024

Thanks for guidance and your input!

For my use case it was enough when AggregatorSelector had signature of fn aggregator_for(&self, descriptor: &Descriptor) -> Option<Arc<dyn Aggregator + Send + Sync>>. It wasn't super convenient to use, but having instrument name and kind was enough to set proper histogram bucket sizes.
Views are more powerful and I like to have a power, but in my opinion they are too powerful, especially considering that they are not requested that much :). Regarding .NET i'm a bit surprised, maybe they have non-compatible (with SDK) ways to achieve same things? (worst case you implement your own MetricReader :))

In my opinion, supporting 0-N number of views was a mistake, because it's not obvious how things behave without reading very carefully SDK (or even better source code), and what is worse, you'll get warnings if you use * to change default.
If it would be exactly 1 view, things could be much simpler. I would be happy to have a simple interface with a fn change_view(&self, stream: StreamView) -> StreamView.

  • Default implementation would be stream
  • Drop everything would be stream.aggregation(Drop)
  • For more complicated scenarios StreamView could have some helper functions like matches_name that could use * and ?, but for the most part you would write simple code.
if stream.matches_name("histogram*") {
  return stream.aggregation(Sum)
} 
if stream.scope.name =="mylib" {
  return stream.allowed_attributes([Key::new("size")])
} 
if stream.scope.name == "otherlib" && if stream.name == "my_metric" {
  return stream.description("even nicer description for other lib")
    .unit("X")
}
stream // leave unchanged default
// or .stream.aggregation(Drop) to drop the rest

Anyway, thanks again, and I'll try to get CLA signed.

@fraillt fraillt force-pushed the aggregation-selector-not-needed branch from 9a02aab to 35f1c8c Compare September 9, 2024 08:12
@cijothomas
Copy link
Member

Regarding .NET i'm a bit surprised, maybe they have non-compatible (with SDK) ways to achieve same things? (worst case you implement your own MetricReader :))

No there is no other way. OTel .NET simply don't allow one to change aggregation. Ofcourse it does allow one to change HistogramBuckets, and also flip between Explicit vs Exponential Histogram, but thats about it. One cannot take a Counter and ask it to be aggregated as Histogram and things like this.

@cijothomas
Copy link
Member

@fraillt Your concerns are valid. Views does have valid use cases, but it is also true that its somewhat easy to mis-use and cause issues. My original plan was to remove Views completely from 1st stable release.. It is entirely possible we do that, or possible that we trim it down.

The most common use case for Views is to change Histogram Buckets - @utpilla is looking at fixing #1241 so one can change Histogram Buckets without having to learn about Views.

Is your scenario also just about customizing histogram buckets?

Irrespective of this, the direction of this PR is good.

@fraillt
Copy link
Contributor Author

fraillt commented Sep 9, 2024

Yes, currently I need to customize histogram buckets, but I actually I might want to use rename or aggregation drop functionality as well... :)
Don't get me wrong, I actually think that views are very good addition, and given what they provide I think they have good design. It's just it might be too powerful and that power adds a costs in terms of user experience.

Anyway, what should I do to finish my work? Is updating appropriate CHANGELOG.md files would be enough?

@cijothomas
Copy link
Member

Anyway, what should I do to finish my work? Is updating appropriate CHANGELOG.md files would be enough?

Please pick latest main, and see if the CI turns green. I'll get back to reviewing this after the #2088 release is done! Thanks for waiting!

@fraillt fraillt force-pushed the aggregation-selector-not-needed branch from 35f1c8c to 2ec0110 Compare September 9, 2024 19:58
Copy link

codecov bot commented Sep 10, 2024

Codecov Report

Attention: Patch coverage is 66.66667% with 7 lines in your changes missing coverage. Please review.

Project coverage is 78.2%. Comparing base (5269660) to head (0f016ce).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
opentelemetry-otlp/src/metric.rs 0.0% 2 Missing ⚠️
opentelemetry-sdk/src/metrics/pipeline.rs 87.5% 2 Missing ⚠️
opentelemetry-otlp/src/exporter/http/mod.rs 0.0% 1 Missing ⚠️
opentelemetry-otlp/src/exporter/tonic/mod.rs 0.0% 1 Missing ⚠️
opentelemetry-sdk/src/metrics/manual_reader.rs 0.0% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##            main   #2085     +/-   ##
=======================================
+ Coverage   78.0%   78.2%   +0.2%     
=======================================
  Files        121     121             
  Lines      20936   20849     -87     
=======================================
- Hits       16332   16314     -18     
+ Misses      4604    4535     -69     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@fraillt
Copy link
Contributor Author

fraillt commented Sep 10, 2024

This test is flaky.

 ---- prometheus_exporter_integration stdout ----
thread 'prometheus_exporter_integration' panicked at opentelemetry-prometheus/tests/integration_test.rs:412:5:
assertion `left == right` failed: sanitized attributes to labels

Do you know if anyone is working on it? What should I do?

How to reproduce:

  1. Apply this patch on main
diff --git a/opentelemetry-prometheus/Cargo.toml b/opentelemetry-prometheus/Cargo.toml
index 3bbe65e3..22a33737 100644
--- a/opentelemetry-prometheus/Cargo.toml
+++ b/opentelemetry-prometheus/Cargo.toml
@@ -21,13 +21,13 @@ rustdoc-args = ["--cfg", "docsrs"]
 
 [dependencies]
 once_cell = { workspace = true }
-opentelemetry = { version = "0.24", default-features = false, features = ["metrics"] }
-opentelemetry_sdk = { version = "0.24", default-features = false, features = ["metrics"] }
+opentelemetry = { default-features = false, features = ["metrics"], path = "../opentelemetry" }
+opentelemetry_sdk = { default-features = false, features = ["metrics"], path = "../opentelemetry-sdk" }
 prometheus = "0.13"
 protobuf = "2.14"
 
 [dev-dependencies]
-opentelemetry-semantic-conventions = { version = "0.16" }
+opentelemetry-semantic-conventions = { path = "../opentelemetry-semantic-conventions" }
 http-body-util = { workspace = true }
 hyper = { workspace = true, features = ["full"] }
 hyper-util = { workspace = true, features = ["full"] }
  1. run a test several times cargo test --package opentelemetry-prometheus --test integration_test -- prometheus_exporter_integration --exact --show-output

After investigation with git bisect here's the results

528e0a6c1403f7642a9fd6cfbaf3fd1d7ae8acab is the first bad commit
commit 528e0a6c1403f7642a9fd6cfbaf3fd1d7ae8acab
Author: Cijo Thomas <[email protected]>
Date:   Tue Aug 6 17:29:02 2024 -0700

    Metric refactor - 2x perf and allocation free (#1989)
    
    Co-authored-by: Utkarsh Umesan Pillai <[email protected]>

 opentelemetry-sdk/benches/metric_counter.rs   |   6 +-
 opentelemetry-sdk/src/metrics/internal/sum.rs | 177 ++++++++++++++------------
 opentelemetry-sdk/src/metrics/mod.rs          |  56 ++++++++
 stress/src/metrics_counter.rs                 |   3 +
 4 files changed, 160 insertions(+), 82 deletions(-)

Copy link
Member

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good. Thank you for your contirbution!

Requesting changes just to remove the Prometheus exporter changes from this PR. Please submit that as a separate PR, and we can get it in, and do a release of Prometheus as well.

@fraillt fraillt force-pushed the aggregation-selector-not-needed branch from 3fa603b to 299c1ba Compare September 11, 2024 13:01
@fraillt fraillt force-pushed the aggregation-selector-not-needed branch from 299c1ba to 0f016ce Compare September 11, 2024 19:48
Copy link
Member

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your contributions! Would love to see more contributions.

@cijothomas cijothomas merged commit 5873cae into open-telemetry:main Sep 11, 2024
24 of 25 checks passed
@fraillt fraillt deleted the aggregation-selector-not-needed branch September 12, 2024 05:05
@cijothomas cijothomas mentioned this pull request Sep 12, 2024
4 tasks
@domenicquirl
Copy link

domenicquirl commented Oct 15, 2024

Hi folks, I just had a miserable time upgrading our opentelemetry versions because we make use of custom histogram buckets and after bumping opentelemetry-otlp the AggregationSelector was just gone. I'm still severely confused as to how I am now and will be in the future supposed to do this, so I wanted to point out a few things I ran into in hopes it helps improving them:

This change made in into the changelog for opentelemetry-sdk (v0.26.0), but there is nothing in the corresponding entry for opentelemetry-otlp, even though this PR removed with_aggregation_selector from the OtlpMetricPipeline. This should probably be listed as a breaking change.

The changelog entry that does exist for opentelemetry-sdk just says that the same functionality "should" be achievable by using views. It does not say whether this is actually possible, nor how one would do so.

Views seem to be under-documented in general. While #1060 exists and examples were added for usage from opentelemetry-sdk in #1170, I was not able to find any official documentation on how to use views at all with opentelemetry-otlp. As far as I was able to find out, there is no way to actually do so directly through the OtlpMetricsPipeline returned from opentelemetry_otlp::new_pipeline().metrics(), which is how meter providers are constructed in the crate documentation. I did find #1241, but was again left wondering how I was supposed to make use of the advisory in this context.

In the end, I think I made it work without using an OtlpMetricsPipeline by constructing exporter, reader and provider individually following #1174 (comment), which is also out of date already but pointed me in the right direction.

When reading about views in #1849, I stumbled across #1849 (comment), which seems to indicate that this method of customization will stop working in upcoming releases, maybe being re-added in the future. You again suggest using #1241 instead, but at this point I am just lost as to what plugs in where.

Apologies for highjacking this PR for my comments, but it does feel like the OTLP crate got the short end of the stick with these changes. Putting aside my confusion, imo at minimum a changelog entry should be added for the breaking change in opentelemetry-otlp and it would be nice if that showed or linked to how to replace an aggregation selector with a view or however else is supposed to be the "correct" way of customizing things here going forward.

@fraillt
Copy link
Contributor Author

fraillt commented Oct 15, 2024

Hi,
I'm not deeply involved in all the details, but I think that Views are the only option at the moment...
The reason why AggregationSelector was removed by this PR, is because at some point it became basically useless. It changed signature from Descriptor -> Aggregator to InstrumentKind -> Aggregation.
IMHO, views will not be removed, (because there's no replacement for them yet), and there are at least two people (you and me) who needs this functionality. :) at the same time I agree, its not trivial to use them, (I actually found it easier to just go source code and figure out how they work...) so I think there will still be some changes.

@utpilla
Copy link
Contributor

utpilla commented Oct 15, 2024

@domenicquirl Apologies for the confusion! Since this was more of an API/SDK feature than OTLP exporter, we didn't update the Changelog for OTLP Exporter. However, as you pointed out, we could have done a better job at updating the SDK changelog.

The best way to do customize histogram buckets is through the with_boundaries Hint API. This API was added in #2135 and it is independent of the exporter being used. You can find the usage in the metrics-basic example:

// Create a Histogram Instrument.
let histogram = meter
.f64_histogram("my_histogram")
.with_description("My histogram example description")
// Setting boundaries is optional. By default, the boundaries are set to
// [0.0, 5.0, 10.0, 25.0, 50.0, 75.0, 100.0, 250.0, 500.0, 750.0, 1000.0, 2500.0, 5000.0, 7500.0, 10000.0]
.with_boundaries(vec![0.0, 5.0, 10.0, 15.0, 20.0, 25.0])
.init();

Just FYI Histogram Hint API only covers customization of buckets. Other aggregation customizations would still be covered by Views. We would be focusing on improving and fixing Views after the stable release.

@domenicquirl
Copy link

Thanks for following up! Very much appreciate your acknowledgement regarding the changelog.

The best way to do customize histogram buckets is through the with_boundaries Hint API. This API was added in #2135 and it is independent of the exporter being used. You can find the usage in the metrics-basic example:

@utpilla I've seen both the hint / advisory API and the example by now, but I'm actually using tracing_opentelemetry with opentelemetry_otlp, so there's two more layers between my code and anything from opentelemetry_sdk. I can set up a MeterProvider which I pass to the tracing layer, but it is the MetricsLayer which is responsible for creating the histograms and which would thus need to pass boundaries for them. At the moment, it doesn't, and there isn't really a way to hook into its event processing that I know of that would allow me to do this.

It might be that this is fine and tracing_opentelemetry just needs time to catch up. There's an open issue on the repo (tokio-rs/tracing-opentelemetry#108) about supporting custom histogram buckets using the advisory API, but it hasn't seen much action.

That said, I think even for users of just opentelemetry_otlp without tracing it would be nice if there was a way to achieve this via the OtlpMetricsPipeline. It's reasonably more complicated having to drop down to the individual components and it's not just required for histograms: the pipeline currently doesn't offer any functionality for views (perhaps because of the temporary removal that was considered?).

@cijothomas
Copy link
Member

@domenicquirl Thanks for taking time reporting the issue (very well written!).

Couple of follow ups:

  1. The OtlpMetricsPipeline has a questionable future, depending on OTLPExporter Pipeline issues #1810. Configuring SDK (i.e MeterProvider) should be done in an exporter-neutral way, but OtlpMetricsPipeline breaks that by putting too much responsibility to an exporter.
  2. Can you share your use-cases of needing customizing AggregationSelector. We are quite likely to remove View from 1.0 scope, but want to understand common use-cases for needing Views. (The most common use case I am seeing is the histogram buckets, then cardinality capping/dropping-attributes.)

@domenicquirl
Copy link

@cijothomas sure thing! We are only using the selector (or now a view that does the same thing) to customize histogram buckets, but this is pretty much required since the time intervals we're measuring are expected to be quite short. This is true in our application for both the things that are measured in seconds and the ones measured in ms. So while we could scale up the values in seconds to ms to better match the default buckets, those go like 0, 5, 10, and then up to 10000 which really doesn't fit small timescales.

To give a concrete example, we use https://crates.io/crates/tokio-metrics which tracks metrics like "time to first poll", "time to start running again after being woken up", and "time taken when polled" for tokio tasks. These measurements are often in the nanosecond range if everything is running smoothly and only go up into the default bucket range of milliseconds during high load. So for these metrics, 10ms (the third default bucket) is really on the upper end of the scale (it's our > 10th bucket currently).

Besides that, as I mentioned previously we are recording the metrics through tracing_opentelemetry, which currently does not expose a way to customize the meter provider / the created instruments. The proposed extension in tokio-rs/tracing-opentelemetry#108 would technically provide such a way, it seems like this would need to be configured on a per-histogram basis, which would be really tedious if you need to customize basically all of your histograms, as is currently the case for us.

Thus, I think our primary motivation here could be summed up by saying that customizing the AggregationSelector and now customizing the aggregation behaviour using views provides a mechanism to specify a default aggregation for a set of (possibly related) metrics in one place (either all of them or using a more granular filter that views allow you to write) instead of manually customizing N individual histograms. Then, secondarily, that it's currently the only way to do this at all, until third-party crates catch up with the advisory API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants