-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluation of brozzler's scalability? #250
Comments
|
Hi, Im new to Brozzler/warcprox and am notice something that could possibly cause a slowdown at scale. This networked pipelined system contains a lot of reads/writes http, warc file io and rethinkdb tcp. This setup will eventually become IO limited at scale and hit GIL thrashing pretty quickly. When digging into the code, im seeing some thread pool executor parallelization usage but it probably wont help much when scaling and in the worst case could cause some race conditions/unexpected behavior. There is limited usage of asyncio and modern concurrent features in later versions of python (in fact only warcprox benchmark script did it fully). IMAO this framework needs to step up w/ modern concurrent Python patterns and replace thse these IO blocking touchpoints:
I realize this is a big effort across multiple repos but it would be fairly straightforward to add. adding full async support to python codebases is a big lift compared to nodejs which was built for these use cases |
Rethinkdb supports asyncio OOTB: |
@goelayu are you running one Brozzler worker and trying to scale up the browser pool? Or multiple Brozzler workers? I don't work on Brozzler, but my understanding is that to scale things up, you are supposed to run multiple workers. |
I am curious if there is any data reporting how well does brozzler scale with increasing the number of parallel browsers?
In my current (very limited) test bed, brozzler takes extremely long to crawl web pages and store the corresponding resources.
Attaching some results when I attempt to crawl 20 random web pages with brozzler while enabling headless Chrome browser.
Scalability results
I also track all the system resource usage (CPU, NW, disk). I am currently running this experiment on a 32 core linux server with 1Gbps NIC and storing data on an underlying hdd with r/w throughput of 150-200MBps
As you can see, neither of resources are being saturated, and yet brozzler is taking on avg ~40-50s to crawl and store a single page. Furthermore the low CPU usage is extremely concerning, since in my experience increasing the number of parallel browsers linearly increases the overall CPU usage of the system. This could be due to the proxy server used by brozzler?
Also, when I crawl the same corpus of pages using an extremely lightweight, custom, nodejs based crawler (written on top of puppeteer), it can do so about 10x faster than the above observed timings.
The text was updated successfully, but these errors were encountered: