-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Off CPU profiling #196
Off CPU profiling #196
Conversation
@@ -4,14 +4,6 @@ | |||
#include "tracemgmt.h" | |||
#include "stackdeltatypes.h" | |||
|
|||
#ifndef __USER32_CS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This got moved to tracemgmt.h
to make it available to other entry points.
@@ -602,151 +594,6 @@ static ErrorCode unwind_one_frame(u64 pid, u32 frame_idx, struct UnwindState *st | |||
#error unsupported architecture | |||
#endif | |||
|
|||
// Initialize state from pt_regs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This got moved to tracemgmt.h
to make it available to other entry points.
support/ebpf/tracemgmt.h
Outdated
#endif // TESTING_COREDUMP | ||
|
||
static inline | ||
int collect_trace(struct pt_regs *ctx, TraceOrigin origin, u32 pid, u32 tid, u64 off_cpu_time) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With moving collect_trace()
to tracemgmt.h
its arguments changed. origin
was added to further keep track what triggered the stack unwinding. pid
and tid
also got added as argument, as the entry point for off CPU profiling does some filtering on these parameters, so it didn't make sense to call the helper for this information multiple times. And finally off_cpu_time
got added as argument, to forward the information for how long a trace was off CPU and store the information in the Trace
struct.
func initializeMapsAndPrograms(includeTracers types.IncludedTracers, | ||
kernelSymbols *libpf.SymbolMap, filterErrorFrames bool, mapScaleFactor int, | ||
kernelVersionCheck bool, debugTracer bool, bpfVerifierLogLevel uint32) ( | ||
func initializeMapsAndPrograms(kernelSymbols *libpf.SymbolMap, cfg *Config) ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With adding an additional argument from *Config
to initializeMapsAndPrograms()
it made sense to just pass *Config
rather than every single argument by itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give my thought despite not being a maintainer.
support/ebpf/off_cpu.ebpf.c
Outdated
if (bpf_get_prandom_u32()%OFF_CPU_THRESHOLD_MAX > syscfg->off_cpu_threshold) { | ||
return 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the numbers will be statistically valid given this check. Shouldn't the probability of sampling be based on the amount of time spent? (Similar to how e.g. the probability of heap allocation samplers recording a stack is determined by the amount of bytes allocated).
Imagine if the off-CPU time of some process is dominated by one call to read
that hangs forever (e.g. due to networking misconfiguration). Unless you get lucky enough that you hit this threshold on that call, you will not see it reflected in the profile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imagine if the off-CPU time of some process is dominated by one call to read that hangs forever (e.g. due to networking misconfiguration). Unless you get lucky enough that you hit this threshold on that call, you will not see it reflected in the profile.
I think you describe a good example of the general risk of a sampling approach. In #144 a sampling approach is proposed, as handling every scheduling event in both hooks is a too high risk to overload the system. If the sampling should be done on the off-CPU time, then there is also a management overhead to correctly size the eBPF map that communicates the start of the off-CPU time to the end of the off-CPU time. So yes, using a sampling approach to drop some events in the first hook will miss out on some events.
But similar to regular sampling based on-CPU profiling, if something is misconfigured, then there will be not a single event but multiple ones and so the issue will be visible to the profiler. The same applies to off-CPU profiling, I think. Overall, sampling based profiling provides valuable insights into the hot paths of the system for (expected or unexpected) things that happen often. Sampling is not an approach that catches every event, otherwise it would qualify as debugging or security utility - but sampling does not satisfy this requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely agree that the feature must use sampling. However the strategy used for sampling makes a difference.
E.g., see here where a similar analysis was done for jemalloc: https://github.com/jemalloc/jemalloc/blob/2a693b83d2d1631b6a856d178125e1c47c12add9/doc_internal/PROFILING_INTERNALS.md#L4
The authors concluded that they needed to sample per-byte, rather than per-allocation, because sampling per-allocation increases the variance significantly in a scenario where the allocations can have significantly different sizes.
The corresponding approach here would be to sample per-nanosecond, rather than per-event.
The downside is then you would have to call bpf_ktime_get_ns
on the begin and end unconditionally (to know how long the task was parked for and then do your sampling logic), whereas now we skip calling bpf_ktime_get_ns
when the sampling doesn't hit. I'm not sure what the overhead of that would be.
I will try to put together a test based on your code to see how much this affects profiling variance in the real world.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the link to the research that got into the sampling strategy for jemalloc.
From my perspective, there are major differences in the purpose of sampling off CPU profiling vs jemalloc that lead to the differences in the sampling strategies.
jemalloc considers small allocations as “cheaper” whereas larger allocations result in more effort. For off CPU profiling the effort to take computing resources from a task and later reassign computing resources are always the same and the time a task is off CPU does not make a difference. So the sampling strategy of jemalloc takes the effort into account while for off CPU profiling the effort is constant.
The general event based sampling approach also comes with the advantage that it lets users predict the direct storage impact that off CPU profiling will have. The storage requirements are linearly correlated to configuration of the CLI flag -off-cpu-threshold
. Changing the sampling strategy will make the storage predictions more complex and harder.
We can also not make assumptions about the environment, where off CPU profiling will be used. Will the environment be dominated by short interrupts, with small off CPU times for tasks, or will there be more larger off CPU times, e.g. if spinning disks are used instead of other faster memory solutions. The general event based sampling approach does not make a difference between these two kinds.
Switching to sampling on the off CPU time will also introduce a management overhead. First eBPF maps need to be scaled larger, resulting in a larger memory requirement for the eBPF Profiling agent and secondly there will be more computing overhead. To make a fair sampling decision on the off CPU time we need to make sure, that every tasked that is handled by the scheduler fits into the eBPF maps. The additional computing overhead will be on calling bpf_ktime_get_ns
twice to get the off CPU time for each task before making a sampling decision on it, and more eBPF map updates and lookups to forward the off CPU time information from the first to the second hook.
With #144 a general sampling approach was accepted, which is implemented in this current state of code proposal.
Maybe @open-telemetry/profiling-maintainers can provide their views on the different kinds of sampling strategies and provide guidance moving on on this topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed response. Your points about the differences in constraints between jemalloc and opentelemetry-ebpf-profiler are compelling.
I still think in the future it is worth testing with different sampling strategies to see how much it affects variance in common scenarios and then making a call whether that's worth the downsides (e.g. less predictable storage usage).
But you've convinced me that this is a good enough default that it's worth starting with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
friendly ping for feedback @felixge - or can this conversation be resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
friendly ping for feedback @felixge - or can this conversation be resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the sampling issue has been discussed before, I must reiterate that the current head-based sampling method has some drawbacks. These include challenges in ensuring the integrity of the sampling chain and the occasional problem of high latency in user localization. A client-based tail sampling approach might be a better-balanced solution. You can refer to this paper for more details: The Benefit of Hindsight: Tracing Edge-Cases in Distributed Systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Benefit of Hindsight is definitely an interesting read on retroactive sampling in distributed systems. But still I believe we should proceed with the current head sampling approach in this PR.
The primary focus here is implementing off-CPU profiling as a technique and functionality also enabling other use cases like on-demand profiling data using kprobes, and the current implementation offers practical benefits:
- Simpler visualization and understanding of the sampling data
- Lower CPU overhead (personal benchmarks showed retroactive sampling increasing CPU usage by ~8% and memory footprint by ~500MB)
- More straightforward implementation with less complexity
While retroactive sampling has its merits, defining precise conditions for it in advance is challenging. The current head sampling provides a general approach and additional filtering during data ingestion is still possible.
Rather than blocking this PR on sampling strategy discussions, I suggest we:
Move forward with the current head sampling implementation and consider sampling improvements and experiments as follow-up enhancements. This way we can get the core off-CPU profiling functionality in place while keeping the door open for future sampling strategy optimizations. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I support moving forward with this approach. We can add additional off-CPU strategies in the future. Since every strategy will have its pros and cons depending on the context, it is IMO more important to move on and debate async on different strategies.
Additional future strategies for collecting off-CPU data allows users to decide which one is best for their use case.
Rebased proposed changes on current main and resolved merge conflicts. |
} TraceOrigin; | ||
|
||
// OFF_CPU_THRESHOLD_MAX defines the maximum threshold. | ||
#define OFF_CPU_THRESHOLD_MAX 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be 100 which is intuitive percentage-wise and also consistent with the probabilistic profiling setting:
#define OFF_CPU_THRESHOLD_MAX 1000 | |
#define OFF_CPU_THRESHOLD_MAX 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1000 is chosen over 100 to provide a more fine grained possibility to configure the off-CPU threshold. On a single core system, using regular percentages makes sense, I think. But the more cores are in use, the more data will be generated. And so I think it makes sense to provide the option to reduce the storage and do off-CPU profiling only for a smaller amount than 1% of all scheduler task switches.
Let me know, if 1% of all scheduler task switches is the minimum that should be allowed to configure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to check if hitting on less than 1% of all scheduler task switches provides value given the current sampling strategy by running some test workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting the minimum threshold to 1% results in 1% or less CPU usage on my system so I don't think we need to go lower as 1-100 is more intuitive, but not a blocker.
This comment was marked as outdated.
This comment was marked as outdated.
reporter/otlp_reporter.go
Outdated
@@ -226,20 +240,25 @@ func (r *OTLPReporter) ReportTraceEvent(trace *libpf.Trace, meta *TraceEventMeta | |||
containerID: containerID, | |||
} | |||
|
|||
if events, exists := (*traceEventsMap)[key]; exists { | |||
traceEventsMap := r.traceEvents.WLock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to hold the lock while looking up the Cgroupv2 ID and creating the key. So I moved fetching the lock to this later point.
Needed to force-push to the branch to resolve merge conflicts. Looking forward on getting feedback. |
Needed to force-push to the branch to resolve merge conflicts. Looking forward on getting feedback. |
friendly ping @open-telemetry/ebpf-profiler-approvers for feedback. |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
43c3ebe
to
35f4293
Compare
To help review this PR and get feedback: Small C program to demo bottlenecks to showcase off CPU profiling: https://gist.github.com/florianl/65cb53eb519531e6d8ee40cf2991bbf6 |
- Introduce a first-class Stack message type and lookup table. - Replace location index range based stack trace encoding on Sample with a single stack_index reference. - Remove the location_indices lookup table. The primary motivation is laying the ground work for [timestamp based profiling][timestamp proposal] where the same stack trace needs to be referenced much more frequently compared to aggregation based on low cardinality attributes. Timestamp based profiling is also expected to be used with the the upcoming [Off-CPU profiling][off-cpu pr] feature in the eBPF profiler. Off-CPU stack traces have a different distribution compared to CPU samples. In particular stack traces are much more repetitive because they only occur at call sites such as syscalls. For the same reason it is also uncommon to see a stack trace are a root-prefix of a previously observed stack trace. We might need to revisit the previous [previous benchmarks][benchmarks] to confirm these claims. The secondary motivation is simplicitly. Arguably the proposed change here will make it easier to write exporters, processors as well as receivers. It seems like we had rough consensus around this change in previous SIG meetings, and it seems like a good incremental step to make progress on the timestamp proposal. [timestamp proposal]: #594 [off-cpu pr]: open-telemetry/opentelemetry-ebpf-profiler#196 [benchmarks]: https://docs.google.com/spreadsheets/d/1Q-6MlegV8xLYdz5WD5iPxQU2tsfodX1-CDV1WeGzyQ0/edit?gid=2069300294#gid=2069300294
This is the code that backs open-telemetry#144. It can be reused to add features like requested in open-telemetry#33 and therefore can be an alternative to open-telemetry#192. The idea that enables off CPU profiling is, that perf event and kprobe eBPF programs are quite similar and can be converted. This allows, with the dynamic rewrite of tail call maps, the reuse of existing eBPF programs and concepts. This proposal adds the new flag '-off-cpu-threshold' that enables off CPU profiling and attaches the two additional hooks, as discussed in Option B in open-telemetry#144. Outstanding work: - [ ] Handle off CPU traces in the reporter package - [ ] Handle off CPU traces in the user space side Signed-off-by: Florian Lehner <[email protected]>
Signed-off-by: Florian Lehner <[email protected]>
Signed-off-by: Florian Lehner <[email protected]>
Signed-off-by: Florian Lehner <[email protected]>
Signed-off-by: Florian Lehner <[email protected]>
Signed-off-by: Florian Lehner <[email protected]>
Signed-off-by: Florian Lehner <[email protected]>
Signed-off-by: Florian Lehner <[email protected]>
Signed-off-by: Florian Lehner <[email protected]>
Signed-off-by: Florian Lehner <[email protected]>
Signed-off-by: Florian Lehner <[email protected]>
Signed-off-by: Florian Lehner <[email protected]>
open-telemetry#196 (comment) Signed-off-by: Florian Lehner <[email protected]>
Signed-off-by: Florian Lehner <[email protected]>
Needed to force-push to the branch to resolve merge conflicts. Looking forward on getting feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving to get this in as it's quite large and unwieldy. We can further improve things in future iterations.
} TraceOrigin; | ||
|
||
// OFF_CPU_THRESHOLD_MAX defines the maximum threshold. | ||
#define OFF_CPU_THRESHOLD_MAX 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting the minimum threshold to 1% results in 1% or less CPU usage on my system so I don't think we need to go lower as 1-100 is more intuitive, but not a blocker.
@@ -97,8 +98,11 @@ func (*baseReporter) ReportMetrics(_ uint32, _ []uint32, _ []int64) {} | |||
func (*baseReporter) SupportsReportTraceEvent() bool { return true } | |||
|
|||
func (b *baseReporter) ReportTraceEvent(trace *libpf.Trace, meta *TraceEventMeta) { | |||
traceEventsMap := b.traceEvents.WLock() | |||
defer b.traceEvents.WUnlock(&traceEventsMap) | |||
if meta.Origin != support.TraceOriginSampling && meta.Origin != support.TraceOriginOffCPU { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to propagate a trace that we're going to reject all the way here, but should instead fast-fail as early as possible (e.g. in tracer/events.go
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a check in loadBpfTrace()
with 1c7724a to skip handling of traces of unknown origin as early as possible in the user space side.
Besides that, I'd like to keep this check in baseReporter
as it makes sure that traces are assigned correctly to their respective kind.
Signed-off-by: Florian Lehner <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code LGTM. Thanks a lot for looking into this in your spare time! I played around a bit with it locally and it seems to be working as expected.
I agree with @christos68k: this touches a lot of different components and I imagine that constantly rebasing it is very painful. I'm OK with merging this as-is, as an experimental, opt-in feature.
* Bump golangci-lint to 1.63.4 (open-telemetry#311) Author: Tim Rühsen <[email protected]> * CI: reduce boiler plate of environment (open-telemetry#312) * Consistent format of the eBPF code (open-telemetry#310) * Off CPU profiling (open-telemetry#196) * CI: Check for differences in the eBPF binary blobs (open-telemetry#305) * README: drop make target docker-image as pre requirement (open-telemetry#304) Signed-off-by: Florian Lehner <[email protected]> * Setup collector receiver (open-telemetry#160) Author: Damien Mathieu <[email protected]> Co-authored-by: Tim Rühsen <[email protected]> * Send process name as process.executable.name (/proc/PID/comm) (open-telemetry#298) Author: Christos Kalkanis <[email protected]> Co-authored-by: Tim Rühsen <[email protected]> * Remove host metadata collector (open-telemetry#299) Author: Damien Mathieu <[email protected]> * apmint: return error if Attach() fails (open-telemetry#271) * docker: build and push docker builder image (open-telemetry#297) Signed-off-by: Florian Lehner <[email protected]> * Update golang.org/x/net to v0.33.0 (Fix CVE-2024-45338) (open-telemetry#294) Author: Tim Rühsen <[email protected]> * metrics: update package documentation (open-telemetry#295) Signed-off-by: Florian Lehner <[email protected]> * Fix pdata function table order (open-telemetry#286) Author: Tolya Korniltsev <[email protected]> * Fix cross-compilation (open-telemetry#282) (open-telemetry#290) Author: Christos Kalkanis <[email protected]> * Use clang-17 for building tracers (open-telemetry#270) Author: Mathieu Bressolle-Chataigner <[email protected]> Co-authored-by: Co-authored-by: Christos Kalkanis <[email protected]> * Make the CollAgentAddr config optional in controller (open-telemetry#279) Author: Damien Mathieu <[email protected]> Co-authored-by: Christos Kalkanis <[email protected]> Co-authored-by: Florian Lehner <[email protected]> * Remove unused type libpf.TraceAndCounts (open-telemetry#283) Author: Tim Rühsen <[email protected]> * Collector Reporter (open-telemetry#208) Author: Damien Mathieu <[email protected]> Co-authored-by: Tim Rühsen <[email protected]> Co-authored-by: Florian Lehner <[email protected]> Co-authored-by: Christos Kalkanis <[email protected]> * proc: handle nvidia kernel modules (open-telemetry#274) * README: provide new version of devfiler (open-telemetry#275) Author: Florian Lehner <[email protected]> * Extract reporter data generation, and use pdata (open-telemetry#245) Author: Damien Mathieu <[email protected]> Co-authored-by: Florian Lehner <[email protected]> Co-authored-by: Christos Kalkanis <[email protected]> * apmint: warn user about changes in the APM service (open-telemetry#269) Author: Florian Lehner <[email protected]> Co-authored-by: Joel Höner <[email protected]> * coredump: no need for full dump with bundled files (open-telemetry#213) Author: Timo Teräs <[email protected]> * Send executable path with every stack trace (open-telemetry#253) * Send thread name as ThreadNameKey not ProcessCommandKey (open-telemetry#265) Author: Christos Kalkanis <[email protected]> * proc: Fix parsing Kernel Module lines where refcount is - (open-telemetry#263) Author: Frederic Branczyk <[email protected]> * reporter: extend lifetime for cached elements (open-telemetry#260) Author: Florian Lehner <[email protected]> * Reporter: allow extending samples with extra attributes (open-telemetry#237) Author: Joel Höner <[email protected]> Co-authored-by: Tim Rühsen <[email protected]>
This is the code that backs #144. It can be reused to add features like requested in #33 and therefore can be an alternative to #192.
The idea that enables off CPU profiling is, that perf event and kprobe eBPF programs are quite similar and can be converted. This allows, with the dynamic rewrite of tail call maps, the reuse of existing eBPF programs and concepts.
This proposal adds the new flag '-off-cpu-threshold' that enables off CPU profiling and attaches the two additional hooks, as discussed in Option B in #144.
UPDATE:
To help review this PR a special version of devfiler was created that can differentiate between on- and off-CPU profiling information.
One can fetch this special devfiler version via:
UPDATE2:
Small C program to demo bottlenecks to showcase off CPU profiling: https://gist.github.com/florianl/65cb53eb519531e6d8ee40cf2991bbf6