Skip to content

performance 2016 10 26

Jim Fulton edited this page Oct 30, 2016 · 5 revisions

Initial semi-formal byteserver evaluation

The main goal of the byteserver is to provide a more (vertically) scalable ZEO server mainly by using a fast language not constrained by a GIL. And also using object-level commit locks and better temporary file management.

A secondary goal is to provide better performance even under light load.

For the initial operation, we want to stick with the same basic architecture as existing ZEO servers:

  • Writing to and reading from a single file.
  • Asynchronous replication.

Assessing scalability of the server is complicated by the fact that it's easy to have results skewed by non-server-related factors such as:

  • Limited file I/O performance.
  • Saturated client CPU.
  • Saturated network connections.

For this reason, I separated clients and server and chose a server for high I/O capacity. The tests were run in AWS using EC2 instances in a single VPC and subnet.

The machines:

server
i2.8xlarge, 36 virtual cpus, 100K pystones/s, 244G ram, 10Gb network, instance ssd storage
client
m4.10xlarge, 40 virtual cpus, 140K pystones/s, 160G ram, 10Gb network, ping times to server 430 microseconds
big client
m4.16xlarge, 64 virtual cpus, 144K pystones/s, 256G ram, 20Gb network ping times to server 400 microseconds
big client 2
m4.16xlarge, 64 virtual cpus, 143K pystones/s, 256G ram, 20Gb network, ping times to server 400 microseconds

I compared byteserver to ZEO. Clients ran Python 3.5 using uvloop. The ZEO server also used uvloop.

I performed several operations:

add
Add 3 objects in a transaction
update
Update 3 objects in a transaction
cold read
Read 3 objects per transaction
cold read with prefetch
Read 3 objects per transaction using prefetch

For each test, 1000 transactions were performed using the (non-big) client.

Results shown are in microseconds per transaction. Columns are level of concurrency. For concurrence > 1, the numbers are averages.

Server c1 c2 c4 c8 c16 c32 c64
ZEO 5779 6483 8839 13397 24400 48930 105670
byteserver 5521 5580 5967 6144 8970 17804 40802
Server c1 c2 c4 c8 c16 c32 c64
ZEO 5008 5188 8509 12529 23080 47730 107758
byteserver 4678 4917 5120 5845 8970 17667 39800
Server c1 c2 c4 c8 c16 c32 c64
ZEO 3234 3366 3300 3440 4492 7500 13471
byteserver 3154 3092 3100 3112 3049 3503 4374
Server c1 c2 c4 c8 c16 c32 c64
ZEO 1640 1628 1618 1836 3031 5392 10380
byteserver 1367 1424 1463 1501 1583 1762 2297

With ZEO, write operations become more expensive as concurrency increases. For the byteserver, write times were relatively consistent up until a concurrency of 16, after which times increase linearly, likely because disk I/O was maxed out.

ZEO write performance degradation at low concurrency is likely due to the global commit lock.

It appears that ZEO read performance is limited by the global commit lock and and the machine's disk I/O. A separate analysis of a ZEO server with experimental object-level commit locks showed server speed and the GIL to be the limiting factor (to the point that object-level commit locks provided no benefit).

The byteserver seems only to be limited by disk I/O.

ZEO read performance is relatively consistent up to a concurrency of 8, after which it degrades quickly. This appeared to be due to CPU saturation of the client and the server.

Byteserver read perforce also degrades much more slowly above a concurrency of 8. This appeared to be due to CPU saturation of the client. The server was more or less idle.

Because reads are the most common operations, and because they shouldn't be as limited by disk I/O, I explored how hard I could stress ZEO and byteserver servers with read operations.

I first looked at byteserver. Using a single client, it appeared that I was saturating the network connection between the client and server. I added the 2 big clients and found that I started to saturate the network at around 24 clients per client machine (72 clients). A this rate, the CPU load on the server was high, but it wasn't pegged. 96 client processes pegged one of the server CPUs.

What's odd is that with the byteserver, lead was spread over multiple CPUs, but at very high read load, it tended to be concentrated on one of the CPUs. This is a bit of a mystery. The only explanation I can think of off hand is that reads have to look up file locations in an index that's protected by a lock, so perhaps at very high loads, the index serializes requests and becomes a limiting factor. This is worth exploring later.

For ZEO, I was able to peg the server CPU (on a single core because of the GIL) with 10 concurrent clients, and %CPU was in the 90s with 8 clients.

Byteserver is much more scalable than ZEO servers. Write capacity is limited by the basic architecture (single file being written) and underlying disk I/O. Object-level locks provide a win until disk I/O is saturated.

For reads, byteserver scaled to more than 70 clients before request times started to degrade, while the ZEO became saturated at 10 clients.

Note that at low concurrency, while byteserver was faster, request times were dominated by network round trip times, so requests weren't that much faster.

The existing byteserver implementation is very basic (and incomplete). It's likely that there's a lot of room to push performance further in the future.

Oct 30 2016

At the Plone conference, I'd noted that some tests with PyPy suggested that it sped things up by 2x. This was based on a conversation I had with Jason Madden last year. This was with respect to ZEO4. I looked back at those results and the speedups were for writes, not reads. With some help from Jason, I was able to reproduce those numbers. Note, however, that the numbers were for a client talking to a server on the same machine.

I re-ran the earlier analysis to include pypy and along the way, ran into some issues.

  • When I performed the original analyses, I underestimated the danger of saturating network connections between clients and servers in AWS, especially for reads. When redoing read tests, I used 5 ZEO client machines, which were m4.10xlarge machines. With these machines, it appeared I was just beginning to saturate the network at ~16 ZEO clients on a machine. The next larger instance could do a little bit more, but was quite a bit more expensive. I used 5 machines because asking for more would have required asking for an EC2 limit increase and I didn't want to wait. :)
  • In my earlier analysis, I didn't notice that client machines were created in a different availability zone, which made network connections much slower. This time, I was careful to make sure each client was in the same AZ as the server. As a result, ping times were .130 milliseconds, rather ~.430.
  • I also upped my game in the benchmark script by having it save data files rather than reports, which in turn allows me to summarize results as graphs rather than tables.

I ran the analysis with Python 2.7 rather than 3.5 because pypy doesn't support Python 3.4 yet. Pypy was used in the server, not the client. The clients were the same across servers, except that the bytserver client used msgpack rather than pickle for network serialization.

zeo-server-2.7-performance.png

Some things to note:

  • pypy is a big win for reads over cpython.
  • byteserver is still much faster than pypy
  • pypy is slower than cpython for writes. This is surprising.

Observing htop on the server, CPU on the server for the server process hit 100% for cpython at around 8 concurrent read clients. For pypy, the CPU was pegged at around 16 clients.

For byteserver, up until around 60 read clients, CPU percentages were low and spread over multiple CPUs. Over around 60 clients, one of the CPUs starts to load up and becomes pegged at around 80 clients. My initial guess is that this was due to clients needing to synchronize to gain access to the index. I tried splitting the index into separate indexes by oid, but this made no difference. (It's possible that I screwed this up somehow, as it was a quick hack.) So the concentration of computation in a single processor remains a mystery.