-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow dynamically loaded benchmarks from a shared object #43
Comments
I'd suggest sticking to C here for more flexibility, plus not having to deal with C++ name mangling. That said, how much API do you need? AFAICT all you would need is something like typedef struct UarchGroup_ UarchGroup;
void uarch_group_set_description(const char* name);
void uarch_group_add_bench(UarchGroup* group,
const char* id,
const char* name,
long (* func)(uint64_t iterations),
size_t ops_per_loop); or am I missing something? Then you could just have a single public symbol ( #if defined(__cplusplus)
extern "C"
#endif
void uarch_register_benches(UarchGroup* group) {
uarch_group_set_description("Population Count");
if (__builtin_cpu_support("popcnt"))
uarch_group_add_bench("popcount-native", "POPCNT", psnip_popcount_builtin_native, 1);
uarch_group_add_bench("popcount-builtin", "__builtin_popcount", psnip_popcount_builtin, 1);
uarch_group_add_bench("popcount-table", "Table-based popcount", psnip_popcount_table, 1);
uarch_group_add_bench("popcount-twiddle", "Bit-twiddling popcount", psnip_popcount_twiddle, 1);
} The API should be simple enough that versioning shouldn't really be an issue, but if it is you can always switch to something like As for enumeration, all you need to do is run through all shared libraries in a directory. If they have a |
@nemequ - yeah, using a C API is a good idea to avoid the versioning and other pitfalls of a C++ API. Yeah I think the API could look something like that. The problem is that even those components haved changed a few times already, so I'm a bit reluctant to lock things down with an API - although I suppose I could say that we just don't supports backwards compatibility. The other problem though is how the benchmarks are actually generated. See a typical benchmark file like So a C API that just passes a |
I haven't dug into those macros yet, but if you can hide all that stuff you'll have a much easier time of maintaining the API even as the underlying implementation changes. I guess you structured it that way to avoid the overhead of an extra function call? |
Yeah the idea is that with the template-based generation mechanism, everything can potentially be inlined into the innermost benchmark and we can avoid overhead of function calls, or any other junk that might leak into the "measured region". In practice, I've written most of my benchmarks in assembly separately compiled by nasm, so that always implies a function call anyways, so the savings isn't as a big as for a benchmark written in C++ that can be completely inlined (in that case though you have to be careful to "sink" the result so the optimizer doesn't defeat your benchmark). So I think in practice you for the API could just make an indirect call (i.e., a call though a function pointer) which isn't much worse than the non-indirect call we are making today for asm benchmarks. Yes, you'll suffer a branch misprediction on at least the time through, but we do a few warmup runs to stabilize these effects. We also try to remove the overheads through delta measurements, which is a whole separate and interesting topic. Right now I have new "mode" not yet committed called "one shot" which is very different from the existing strategy of (1) doing warmup iterations and (2) generally measuring with many interations and taking the min/median of the results. Instead, one shot only calls the function exactly once in the measured region and it may not have any iterations internally, and the times/counters are reported, and this can be repeated a few times. So it can capture cold effects, and you can also see how the 1st try differs from the second, and you can potentially measure transient effects. I'm using this to actually dig more into some CPU uarch details. I mention it mostly because here I'm actually finding that the function call to asm code is problematic so I introduced another mode where you can inline the measurement code directly in the asm code. That approach is one that could work for benchmarks in a shared object: make the person writing the benchmark inline the measurement code. |
Rather than compiling in benchmarks, it would be cool to allow benchmarks to be dynamically loaded from a shared object, allowing decoupling of the benchmark application and default benchmarks from other benchmarks.
This would need at least the following:
dlopen/dlsym
and friends) and enumerate the contained benchmarks.uarch-bench
code.The text was updated successfully, but these errors were encountered: