-
-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce performance measurements #604
Comments
Yeah, ok, you're right :-). I've had some bad experiences with benchmark-driven development before, and it's going to be difficult to figure out what benchmarks are actually useful, but speed is important and data is valuable. I would strongly prefer "macrobenchmarks" whenever possible, i.e. let's serve some webpages or transfer some large files or something that at least looks like solving a real problem. The two tools I know of for tracking benchmarks over time are airspeed velocity (used by e.g. SciPy) and codespeed (used by e.g. speed.python.org). There may be others. I haven't used either seriously myself. My impression is that asv is quicker to get started with, though for serious usage then any solution will need stable infrastructure to do the actual benchmark running. (Travis is not great for this.) Benchmarks should certainly live in the python-trio/ org. I don't know whether it's better to put them in the trio repo or make a new repo – there are technical tradeoffs because you generally want to run benchmarks across multiple versions of trio. Whatever tool we use may also have a preference. It would be nice if this infrastructure could be somehow shared or reused across projects. Maybe we want to track the speed of trio-asyncio? Both because we care about trio-asyncio, and because it's a way to get macrobenchmarks for trio itself. |
A Trio benchmark suite would probably want to include some thread benchmarks too, e.g. as discussed in #709. |
The discussion in this discourse thread convinced me that we have a fair amount of low-hanging fruit that even some simple microbenchmarks could help us flush out. (E.g., I'm pretty sure we're spending way too much time calling So I've been thinking about what could be a useful "first benchmark" for trio core stuff. Some desiderata:
So what I'm thinking is: an echo server benchmark client:
Possible further extensions:
This is a good talk on pitfalls with measuring latency: https://www.azul.com/presentation/how-not-to-measure-latency-4/ It seems like it would be difficult to implement this using Python, because the work the client has to do is similar to what the server has to do (or even more once you consider all the statistics recording). Probably we need to use some fast compiled language. Rust or Go are the most obvious options. Swift is plausible too. Not C. |
I think, complexity of implementation highly depends on acceptable complexity of setup. Not that I teach you how to benchmark, but let me share what I would do. Taking into account how complex is analysis you want to get, I do not think writing anything new in fast language is a good plan. There are lots of ways to make a mistake. I'd rather get some existing load test tool. Tool should be fast enough, but not smart. I am not sure about how complex are your desired load test scenarios, so can't recommend anything. There a lot's of lists of such tools, I'll post one for completeness https://gist.github.com/denji/8333630 Then I'd run N copies of that tool to make sure client side is not bottleneck. What should be N? For Python server and C client two copies are enough I think, but you may try three. I mean ten is too many, clients will be highly underloaded, but make sure client is never more than 50% loaded. For Python-Python I'd go with at least five. Also, you mentioned multiplatforming, so I'll make a quick Windows related comment. In Windows local connections are processed in different optimized way, so clients should not be local to get realistic numbers. Which means clients can be non-Windows. So, what I would do for analysis is to run tcpdump on client side and write a tool (in Python?) to process multiple tcpdump files from all clients afterwards to calculate all kinds of interesting numbers. Probably writing dump to tmpfs may give better performance. Given a dump file you can measure all kinds of latency observed by client, number of packets send, etc. See all effects of low level TCP optimizations. If you have graphite/carbon or similar tool installed and if you are familiar with this tool, then you probably can use it too. Speaking of carbon, network protocol is just a UDP packet with text of "metric-path metric-value metric-timestamp" format. Note metric-timestamp field. You can recover metrics from dump, send them to carbon and get nice graphs for free. |
I've looked at some lists like this, but all the ones I've found are focused on HTTP load testing. Which is certainly interesting, but it means that we'll largely be benchmarking our code for parsing and generating HTTP. I thought an echo server benchmarker would be nicer for isolating details of fairness, task scheduling, etc. But I haven't found any good pre-existing echo server benchmarking tools :-/.
That's an interesting idea! It does sound pretty complex to set up and run in a repeatable way – I'm not sure building a tool to do all this would be easier than writing the tool in Rust or something directly? There's also a subtlety with measuring latency on persistent connections, where you don't necessarily want to measure the latency from the first packet hitting the wire. This is discussed in the "how not to measure latency" talk I linked above. What happens is for a lot of HTTP benchmarkers in particular, each connection executes a loop like: while True:
start_time = now()
send_request(conn)
read_response(conn)
request_latency = now() - start_time Then the problem is that if the server is really slow to respond one time, then the client sits in I'm not sure how important this is for the kind of benchmarking I'm talking about with an echo server, but it's an interesting thing to think about. |
From one side HTTP is dominating protocol and what most people are talking about, no? I hardly remember anyone discussing SMTP load testing. From another side, yeah, parsing HTTP is not very fast, but is it very realistic. What's the point of benchmark without any protocol parsing? Higher numbers, but very synthetic conditions not correlating to any real life load. If you want to focus on network I/O and not parse HTTP in Python, then try httptools library. I use it, it is written in C/Cython and is pretty fast.
I totally agree that setup is relatively complex and relatively hardly repeatable. Relative to single command line. I'm pretty sure ugly bash script may do it, if you stick to particular Linux distro. Major advantage for me is that Trio is Python library and if you write your client in Rust, then probably fewer people will even understand how it works, to say nothing about actually helping. And if you postprocess with Python it seems more friendly to Python community you already have. You may post huge dumps taken in well-known controlled environment for people to review them, make decisions, even find bugs. Maybe I'm too optimistic, but I have good experience with this approach. I'd parse some dumps probably :-D. Also, it lets you gather new stats from old experiments.
Yes, and that's why I'd stick to existing async/threaded tools, which already address this and many other catches; and run multiple clients anyway, just in case.
So, if I got you correctly, just checking. Let's say we want to send one request per two second, five requests total. And usually request takes one second to process. So we:
And here suddenly server freezes for 3 seconds. So
With schedule (app view) average latency is 2.2s, without schedule (tcpdump view) average latency is 1.6s. Now, how realistic is this? What real applications need to send network requests (polling?) at specific intervals and suffer from latency. Web browsers? - no. |
The point is to isolate send+receive from protocol handling, so that we get a clearer picture of exactly which areas are affected when we have a regression. This is not about real-life load. Real-life loads are disparate enough, your real-life load might not correspond to mine in any meaningful way.
Forget the "specific intervals" part, that's an artefact of this being just one possible way of doing such a test. However, there is a material difference between "one request is stalled for 3sec but the server is otherwise responsive" and "the server is blocked and requests queue up". We want to design our tests so that we actually see that difference in the results: a lone 3-second spike is indicative of the former while a spike with a long(ish) trail of lower-latency requests behind it probably points to the latter. One possible application where this would matter: a file server. Or a streaming media server. Or, yes, games. Sure it's bidirectional with pub/sub, but so what? you still want an indication that the opposing RPC has been blown up immediately (or as immediate as feasible) after firing your BFG. |
I tend to group benchmarks into two categories: ones that focus on answering specific questions about specific known components of a system under precisely controlled conditions ("microbenchmarks"), and ones that try to estimate actual end-to-end performance in a specific real scenario ("macrobenchmarks"). It's very similar to the distinction between white box or "unit" testing versus black box or "integration" testing. It's not a perfect split, benchmarking is about compromises, but it's a useful one. The echo server benchmark idea is on the microbenchmark side – you're right that it doesn't correspond to any real scenario, but that's not the goal. The goal is to understand the performance characteristics of Trio's scheduling and socket layer, because the better we can understand them, the better job we can do optimizing them for all kinds of workloads. My concern about wedging some httptools kinda parser into a benchmark like this is that it doesn't seem like it fits well into either category. We don't care about understanding this specific parser's performance characteristics, so it's not helping us microbenchmark. But it also doesn't produce a realistic end-to-end macrobenchmark – for that we'd need to use a real HTTP server like Quart-Trio, and ideally a real web application and a realistic mix of request types.
If we did go the Rust/Go route, then both those languages make it very easy to distribute self-contained binaries that people can download and run on whatever system they have. And probably we'd make the client itself dump large quantities of raw-ish data, and then do any detailed analysis in Python afterwards, because you're right that just makes more sense :-)
[Note: I realized that the link to the Azul systems talk in my post above was pointing to the wrong version of the talk, that was missing most of the stuff I'm talking about here! I updated the link to point to the right version.] Well, here's why people started worrying about this: they noticed that for most popular benchmarking tools, if you run say a 10 second test, and your server glitches out so that it's completely frozen and non-responsive for 8 seconds of the test, then... the benchmarking tools say "99.99% of the time, the latency was under 1 ms". Which makes no sense at all. So the problem seems real. I'm not sure whether their proposed solution is the best one. I think the root issue is: we mostly care about how our HTTP server performs when there are lots of different clients each trying to make a small number of requests. But most benchmarking tools aren't set up to create lots of independent, simultaneous connections. Instead they try to fake it, by having a few connections, and then sending lots of requests on each connection. This works OK in some cases, but it really falls down when measuring tail latencies. Because with the small-connections-with-lots-of-requests approach, when one request exhibits a really long latency, then the benchmark tool just stops issuing requests entirely for a while. But this is unrealistic, because in a real system with lots of independent clients, if you're slow to respond to one client, that doesn't stop requests arriving from other clients. I think another way to solve the problem is for the benchmarking client to actually create lots and lots of independent connections. |
This is because describing everything with one number is what people on It often do, but also wrong; and percentile thing is just easy to understand oversimplification. Whatever I test, I get much more usable (and by usable I mean decision making) results with classic average and standard deviation. And laggish freezing performance significantly increases standard deviation. So since I learned this, I tell a range of [μ - 3σ, μ + 3σ] and not average. If and only if σ is orders of magnitude smaller than μ, average makes any sense. So even with some assumption of distribution it should be at least two numbers. Reference material: https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule If you're not familiar with concept, just try playing with AVERAGE and STDDEV.P in Excel or Google Sheets, you will see that you will not able to simulate unnoticeable lag. |
IME most good benchmarking tools report results as a set of quantiles or a histogram, which tells you much more about the shape of the distribution than mean/stddev. The 68-95-99.7 rule is definitely not applicable here – it only applies to data that follows a gaussian distribution, and latency distributions are extremely far from being gaussian. The idea of using quantiles/histograms makes sense; the problem is that they're being computed from bad data. Which isn't to say mean/stddev are useless. And it's true that in this case mean/stddev would show an anomaly, because they're extremely sensitive to small numbers of outliers. But they'll still wildly underestimate the true mean/stddev of your application's latency, because they're being computed from bad data too. For example, lets say our tool does 1000 requests in the first second, then gets stuck for 8 seconds, then does 1000 more requests. The regular requests take 1 ms, and during the stuck period it does one request that takes 8 seconds. Then our mean latency is 5 ms, with stddev of ~180 ms:
OTOH if our tool had kept up the 1000 requests/second across the whole 10 second measuring period, then we would have gotten a mean of ~3.2 seconds, and stddev of ~2.6 seconds:
So the results are still wildly incorrect. |
Histograms are too much information to me. More confusing, than helping. If I know that one option gives me 5000 on average and another option gives me 7000, that is oversimplification and I demand more information. If I know that one option gives me 4000-6000 and another option gives me 3000-10000, I have enough information to make a decision. I can think if I prioritize average or worst case. I can think if I will be able to utilize that lucky moments when performance is the best or not. This information is succinct, but also usable. If I see two histograms, I honestly get a drink, because comparing hundreds of numbers represented with coorditates and colors is not my thing. You are right about distribution, but hey, if you get mean of 5 with stddev of 180, it is red light. If stddev is not two orders of magnitude less than mean, I become highly suspicious. Either number of measurements is not enough, or system is so unstable it can't be helped (hello AWS EBS I/O). So while you are totally right, I'd prefer simplified, but not oversimplified model, to have something I can compare. |
I'm a fan of instrumenting portions of the user's actual code via means as lightweight as possible to get performance information. Mostly thinking about CPU performance here, which is quite relevant given that Python is effectively constrained to a single core. Sampling profilers perform well but are not useful for finding execution times, let alone tail durations-- and any real code is going to have multiple complex paths so having a histogram is important. I already have a little I have a fast, streaming, approximate histogram implementation (observation takes ~2 usec using numba implementation on my old laptop) which makes no assumption about the input distribution (i.e. no fretting about bin configuration). I've been meaning to open source it. |
Along the lines of
|
@belm0 Unfortunately, the only way to compute per-task CPU time (with or without child tasks) is to call I'd also be curious to hear more about what you would use this for... a sampling profiler is a good way to see where your CPU time hotspots are in general, and a regular wall-clock timer is good for measuring latency, but I don't (currently) know any reason to measure, e.g., tail CPU time. |
I think using I do use a sampling profiler, but it's a blunt tool. It will point out gross inefficiencies, but it's not going to help you with a death by 1,000 cuts situation. Specifically, it's not going to help you decide if an algorithm tweak or new implementation of some often-called small function is a win. That's going to be significant if we want guidance on optimizing Trio's scheduling internals. This is why staring at a sampling profiler for days yielded no useful suggestions for Trio optimizations (#943). I've found it very useful to measure execution time of a block of code of interest-- over the full session of an application, encompassing all inputs and their effect on code paths (and hence needing a histogram). It's a very sensitive measurement, immune to the fickleness of sampling, and lets you clearly see effects of program changes at any level in the instrumented code's scope. |
Here's a stab at a task timing function. It's based on perf_counter() rather than thread_time(), as the former is about 10x faster. def task_perf_counter():
"""
For the current Trio task, return the value (in fractional seconds) of a
performance counter, i.e. a clock with the highest available resolution to
measure a short duration. It includes time elapsed during time.sleep,
but not trio.sleep. The reference point of the returned value is
undefined, so that only the difference between the results of consecutive
calls is valid.
""" https://gist.github.com/belm0/4fdb51ce04c5c26f01d58913ae5c43da |
I put together a small HTTP server benchmark for asyncio, uvloop and trio: https://paste.zi.fi/p/ahbench.py/view Results on Macbook with i7-4870HQ: https://paste.zi.fi/p/ahbench-results.txt/view |
@Tronic huh, neat! It's obviously a very narrow/specific microbenchmark, but those results are pretty promising as a start (the trio socket-level implementation beats the stdlib on throughput!), and it already raises some interesting questions:
|
@Tronic would you be okay with stashing those somewhere more permanent, so people reading this thread later will know what we're talking about? |
@njsmith There you go https://github.com/Tronic/async-http-benchmark |
Interesting find about Trio latencies. I ran the tests again and it isn't just a random hiccup. I guess some tracing would be needed to find out whether it is unfair scheduling or something more severe that is causing it. Another interesting bit that one encounters implementing the same program against multiple async I/O libraries, is that from the outside and for the main logic they seem quite similar. The devil is in the details: proper cancellation, not getting stuck on some event that never comes, and sockets that Just Do The Right Thing, and this is where Trio really shines. I've recently worked with all three asyncio socket APIs, as well as curio and trio. Every problem I've encountered with the other modules seems to be taken care of in Trio, and usually briefly mentioned in its documentation or source code comments. Amazing work! |
It would be useful to show median latency given that variance. |
It turns out that Profiling summary at sanic-org/sanic#1662 (comment) I did not look too closely into what exactly There is also considerable overhead in high level streams, making it worth a look if there are any easy ways to make them faster but not less correct. The profiling was based on plain TCP streams that still have fairly simple logic, not SSL. The good thing is that, to the degree that cProfile could tell me, the core scheduling of Trio is not slowing things down. |
Interesting results! Did you try using a sampling profiler such as https://github.com/benfred/py-spy to make sure that cProfile isn't skewing your results too much? |
Nope but reducing the number of deadline-setting points per request led to clear improvement in benchmark scores, so the profiling results seem accurate in that regard, and the results didn't differ too much with and without profiler. I suppose that one could not use Trio deadlines and instead work this around in application level, having the requests update "last activity" timestamps whenever they do something (reading the clock is quite fast), and running a background task that sleeps 10 s and wakes up periodically to check if other tasks are past their deadlines, asking Trio to cancel those that are expired. But then, this would be reimplementing specifically the kind of thing that Trio is supposed to handle. |
It would be nice to see detailed profiles, so we could tell what part of the deadline code was actually taking up time :-).
Just to check, are you on trio 0.12? that rewrote a bunch of the machinery for adding/removing CancelScopes to make it more efficient. And maybe there are still some unnecessary inefficiencies we haven't noticed because we haven't looked at the right profile. Beyond that – deadline tracking is notorious source of bottlenecks in servers, so I guess it's not a surprise if it shows up here. The problem is that you need to be able to efficiently find expired deadlines, which requires some kind of sorted or heap-like structure, but you also need to be able to efficiently remove deadlines, which requires some kind of random access. It's very difficult to build a data structure that's super-fast at both of these things at once, and very easy to build one that's accidentally super-inefficient in some corner cases. So there are tons of different designs people use – search for e.g. "timer wheel" to see some examples. Right now, Trio uses a very simple-minded approach, by storing pending deadlines in a One simple approach that might work well would be: Use the stdlib It's also possible that the So there's definitely stuff we can do here... is this something you're interested in working on? I think the first step would be to make a PR with some benchmarks. I can speculate about random ideas for speeding things up, but as you can see it's a pretty complicated topic, so we definitely can't predict ahead of time what will work best. Reproducible benchmarks are great because they let us try stuff and see what works, and compare our results. |
It'll take a while (estimate two weeks) until I can get back to this but I'll see if I can optimize it and make a PR. This week I'm at movie shootings with no access to the machine where I did the profiling, but even the summary shows that too much time is spent both inside of |
PR #1234 |
Excuse me, maybe is not useful for your case, but do you know pyperf? It's used by py devs too. |
I accidentally nerd-sniped my friend @nelhage with @Tronic's HTTP benchmarks that he pasted up-thread (and in particular the nasty latency distribution, which was the part that bothered me the most). It turns out that most of the latency issue was an artifact of the benchmark setup: Tronic/async-http-benchmark#1 In the updated numbers there's definitely still some room to improve Trio's throughput and latency, but now it looks like the kind of thing I'd expect and that we can handle through regular profiling/optimization, not some mysterious totally different latency distribution. |
Thanks. The benchmark was originally intended only for throughput but since you found the latency results interesting as well, it's good to have that fixed (PR merged). |
plugging my perf-timer tool in case it could be useful somewhere
|
Just chiming in to express an interest in simple benchmarks of nursery creation/deletion overhead. |
any updates in such an area? |
Obligatory background: we all agree (or at least nodded our heads in agreement when this was mentioned many times in the past) that Trio has a number of priorities over speed of execution. Optimizing for speed or any other performance metric should never hinder simplicity, "humanness", correctness, maintainability, predictability, security, debuggability, or other design principles. Premature optimization might be the root of some evil, and focusing on efficiency carries the risk of getting in the way of innovation. Even just identifying what is performance critical takes a lot of effort and is a moving target.
Given all of this, I believe we should provide some tools to make sure that Trio doesn't add too much overhead and allows efficient user code to run efficiently. This is more complicated than "create a bunch of benchmarks" but I want to start by creating a bunch of benchmarks, at least to start thinking about this. I had found myself caring about performance in a few cases and I want to do more work in that direction. It's relevant to me and it might be to someone else so it would be nice to share.
When I say a bunch of benchmarks I think of something that should reproduce a realistic workload for a Trio application. I don't think we are in a position to actually understand what a "realistic workload" is. I have a couple examples that I can derive from things I do for my paid job, and we can think of some more more (including something from projects you are working on, porting asyncio benchmarks and examples, adding trio-asyncio examples, etc). We can also produce some microbenchmarks. Also "performance" is a broad term that can mean many different things. In an I/O application it can mean throughput, latency, bandwidth usage, fairness, CPU impact, memory impact, or anything I did not think about. Covering all of these aspects can be hard (although sometimes some is a good proxy for others, e.g. a microbenchmark for execution time of some operation may provide hints about CPU, latency and throughput).
In terms of process, I don't think that performance measurements should get in the way of approving a PR or anything like that. Yeah, it can be nice to have comparisons but, really, I don't think that we are anywhere close to ready to make performance part of the development process. Benchmarks are lies that can have a mild informative value that is often hard to interpret. Still, like a good test suite, a benchmark suite can sometimes make it obvious that something is a regression, or an improvement, and it can be improved over time to provide better coverage of real-world use cases.
Anything we can learn from other projects? Tools (benchmark runners and the such) we can borrow? General feedback?
@njsmith would you see the
trio
repo as the space for this or would it better live in a different project?The text was updated successfully, but these errors were encountered: