Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics SDK improvements #1740

Open
cijothomas opened this issue May 11, 2024 · 1 comment
Open

Metrics SDK improvements #1740

cijothomas opened this issue May 11, 2024 · 1 comment
Assignees

Comments

@cijothomas
Copy link
Member

cijothomas commented May 11, 2024

Opening a parent issue to track Metrics SDK improvements for Stable release.

Background
The primary function of the Metrics SDK is to accept a number <T> along with a slice of KeyValue pairs <T>, &[KeyValue], aggregating these measurements in memory and exporting the aggregated values to Readers/Exporters as needed. Our main goals are ensuring correctness, thread-safety, memory-efficiency and high performance, particularly on the "hot path" where measurement reporting occurs, as this demands the utmost efficiency. Correctness/thread-safety/memory-efficiency requires extensive testing via unit tests and stress testing.

Performance issues

  1. Cloning and allocation on hot path - A significant portion of the overhead involves cloning the incoming slice to prepare AttributeSet. We can avoid this in many cases by using a thread-local Vec, which would reduce memory allocations, but still requires copy. Copy as well can be avoided by carefully designing temp data structures to hold references only.
  2. Sorting of the Keys - This is a "identity" requirement, so cannot be avoided entirely. However, it is possible to avoid this in the common case, by storing both sorted and original orders for quick lookups.
  3. De-deduplication of Attributes with same Keys - This is another "identify" requirement, but similar idea as above can be used to avoid this in the common case.
  4. Contention - The use of a Mutex around the HashMap for aggregations leads to heavy contention. Replacing it with a RwLock and applying interior mutability could lessen this issue, though sharding may be necessary for further scalability improvements as demonstrated here.
  5. Memory efficiency, mostly affecting delta - Metrics - Delta aggregation should not export unless new measurements are reported in current cycle #1598 . This is relatively easy to fix.
  6. Memory issue - Memory optimization for metric datapoints #1566. This is also relatively easy to fix.

Some issues like calculating hash outside of lock, special casing 0-attributes etc. were addressed already. Also, a lot of ideas were discussed in the past (Community meetings, PRs, issues). I have attempted prototyping several of them here: https://github.com/cijothomas/metrics-mini/tree/main/metrics/src. A lot of the issues from 1,2,3, part of 4 has been addressed in the prototype, giving huge performance improvements. I plan to incorporate them to this repo soon.

It is unlikely that we fix all performance issues for 1.0, but the goal is to ensure that the fixes can be continued even after 1.0 without any breaking changes. This requires trimming off unnecessary public APIs, and also to avoid exposing any internals to readers/exporters.

Correctness issues:

  1. Lack of adequate testing - The existing test suite does not sufficiently confirm the accuracy of aggregations. Although a few tests have been introduced to demonstrate known issues (see this and this), a lot more thorough testing is required.
  2. There are virtually no tests in multi-thread setup. While Rust compiler protects from some issues, it cannot ensure correctness in anyway, and those require carefully orchestrated tests.
  3. For memory efficiency tests also, stress tests should be leveraged.

Most of the correctness issues can fixed via better test coverage. One thing to note is that "Views" feature expands the testing matrix significantly due to its capability to alter aggregations/attributes and produce multiple metrics streams from a single measurement. This is the main reason to remove "Views" from the scope of 1st stable release.

@utpilla
Copy link
Contributor

utpilla commented Aug 24, 2024

Compared the benchmark to emit four attributes with OpenTelemetry .NET and we are comparable to .NET now:

Machine Information OS: Ubuntu 22.04.4 LTS (5.15.153.1-microsoft-standard-WSL2) Hardware: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz, 16vCPUs, RAM: 64.0 GB
Counter_Add_With_Four_Random_Attributes Average time
OpenTelemetry .NET 180.2 ns
OpenTelemetry Rust 186.41 ns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants