Replies: 7 comments 6 replies
-
FWIW, I'd consider:
I'm not sure how you're planning to implement prioritisation, and I think that'll affect things. H3 prioritises URLs as they go into the frontier, which is just a key-value DB with sorted keys and the key order is the crawl order. I think Frontera's strategy workers do the same. This is very efficient in terms of the database structure and speed, but makes it hard to change prioritisation. I guess the priority key would need to be part of your API in this case (?). I expect ElasticSearch prioritisation is query-based? (This would also be natural for an SQL implementation I suppose). That means we're now managing secondary indexes, but the frontier is more decoupled from the prioritisation code. TBH at UKWA we've not had much luck running things that need secondary indexes at high volume and speed (for us, that means a frontier with a few billions URLs, downloading c. 1 billion URLs a month). But that might not be a problem with the tools 😄 . |
Beta Was this translation helpful? Give feedback.
-
Vespa as an alternative to ES? |
Beta Was this translation helpful? Give feedback.
-
We've been using Pinot for a recent project, and found it to be scalable and easy to use. It supports real-time ingestion and bulk loads, though real time is currently only via Kafka. It now supports querying JSON, but if you wanted to order by metadata it might require a UDF for suitable performance at scale, otherwise you could just use Grovvy scripts. |
Beta Was this translation helpful? Give feedback.
-
Bubing has a Frontier implementation that could be reused and exposed through the API |
Beta Was this translation helpful? Give feedback.
-
Hi @jnioche - I had a free hour or so, and decided to take a stab at porting my old DRUM implementation to url-frontier. Just to give it a home, I created a "url-crawldb" repo under this (crawler-commons) organization in GitHub. But I didn't see an interface in url-frontier that a backend service would implement, if we just wanted to plug in something to a layer that provided a common gRPC or HTTP layer - did I miss something? |
Beta Was this translation helpful? Give feedback.
-
Hi @kkrugler That's great! The reference implementation will be part of the URLFrontier project under the service module. Maybe don't describe the subproject you created as the reference implementation but "an implementation"? I'm worried that people would get confused otherwise. Or maybe rename the project to reflect that it is a DRUM implementation? Maybe separate the DRUM part from its use as a Frontier backend so that ppl can use it directly if they want to? The interface is in the API module, it will be available as a Maven dependency soon but for now, you need to compile that module first. Have a look at the client and service modules, they use the API; also the documentation in _API_should help. Let me know if I can help |
Beta Was this translation helpful? Give feedback.
-
Have just merged an implementation of the service based on RocksDB. It seems to be doing the job very nicely so we'll be using that for the reference implementation. |
Beta Was this translation helpful? Give feedback.
-
It is expected that a URL Frontier implementation will run as one (or possibly more) Docker container(s). We will have a simple memory-based implementation as mentioned in #1. For something more scalable and robust we could store the URL info in Elasticsearch as we do in StormCrawler; most of the selection & logic could be borrowed from the StormCrawler spouts.
Here are a few relevant tools: feel free to add to the list
WhirlPool URLFrontier - mercator scheme/rate-limiting/scheduling part of whirlpool project; handles crawler priority and politeness. No explicit licensing.
Drum - to deduplicate incoming URLs
Frontera
Lucene instead of Elastic?
Beta Was this translation helpful? Give feedback.
All reactions