Suggestions for tools behind our reference implementation of the service #2

jnioche · 2020-12-11T16:45:26Z

jnioche
Dec 11, 2020
Maintainer

It is expected that a URL Frontier implementation will run as one (or possibly more) Docker container(s). We will have a simple memory-based implementation as mentioned in #1. For something more scalable and robust we could store the URL info in Elasticsearch as we do in StormCrawler; most of the selection & logic could be borrowed from the StormCrawler spouts.

Here are a few relevant tools: feel free to add to the list

WhirlPool URLFrontier - mercator scheme/rate-limiting/scheduling part of whirlpool project; handles crawler priority and politeness. No explicit licensing.
Drum - to deduplicate incoming URLs
Frontera
Lucene instead of Elastic?

anjackson · 2020-12-14T14:41:51Z

anjackson
Dec 14, 2020

FWIW, I'd consider:

a RockDB backend (for pretty-large-scale-but-still-one-DB).
a SQL backend compatible with CockroachDB (for a distributed/large-scale version).

I'm not sure how you're planning to implement prioritisation, and I think that'll affect things. H3 prioritises URLs as they go into the frontier, which is just a key-value DB with sorted keys and the key order is the crawl order. I think Frontera's strategy workers do the same. This is very efficient in terms of the database structure and speed, but makes it hard to change prioritisation. I guess the priority key would need to be part of your API in this case (?).

I expect ElasticSearch prioritisation is query-based? (This would also be natural for an SQL implementation I suppose). That means we're now managing secondary indexes, but the frontier is more decoupled from the prioritisation code.

TBH at UKWA we've not had much luck running things that need secondary indexes at high volume and speed (for us, that means a frontier with a few billions URLs, downloading c. 1 billion URLs a month). But that might not be a problem with the tools 😄 .

3 replies

jnioche Dec 15, 2020
Maintainer Author

thanks @anjackson

I need to describe some of the assumptions behind the API and prioritization is definitely one of them. What I have in mind is to have it determined by the nextFetchDate attached to the URLs when they are sent for storing by the crawlers. By default, the URLs will be sorted by that value so that the oldest URLs are to be fetched first. This is similar to what we do in StormCrawler with Elastic. It will be possible to let users specify a different primary sort, based on a value passed as metadata, maybe in a later version of the API; this way people will be able for instance to take the depth of crawl into account.

A key aspect of the URLFrontier is that we want to be able to control the scheduling of a URL so that it can be refetched in the future. This is useful for temporary fetch failures which will be retried later but also for continuous crawls where a URL such as a news feed is periodically refetched. If a URL is not to be refetched ever, then we can set a value in a distant future.

This is different from the view of a URLFrontier as a collection of buffers with a cache of URLs already seen in front. I think it will allow a wider range of use cases.

It also means that we have to store all the URLs and their associated metadata and, as you pointed out, tools behind the bonnet that provide more than just k/v storage.

Is this too ambitious do you think? Should we have different versions or tiers of the API for revisitable URLs vs one-off crawls?

anjackson Feb 9, 2021

Ah, I see you covered some of the things I talked about in #11 here. Sorry, I'm having trouble keeping up with everything what with one thing and another.

The 'ordered by next fetch date` queue prioritisation is not terribly appealing to me to be honest. I can see it working in the case where you pretty effectively get through the queues, but TBH we've moved away from scheduling future crawls directly in the frontier because it makes it difficult for change the schedules. i.e. we're using the H3 frontier to prioritise the backlog and manage crawl delay. We launch and relaunch crawls using external cron jobs that inject launch events into the crawler. This means if we want to stop recrawling (for example), we just don't request a recrawl. But perhaps there's a better way?

Now I'm confused as to where scoping and canonicalisation happen?!

jnioche Feb 10, 2021
Maintainer Author

Have created a specific discussion around prioritisation in #12

Now I'm confused as to where scoping and canonicalisation happen?!
What do you mean by these?

The filtering and normalisation of URLs is done outside the Frontier

jnioche · 2021-01-15T15:31:52Z

jnioche
Jan 15, 2021
Maintainer Author

Vespa as an alternative to ES?

0 replies

kkrugler · 2021-01-15T16:50:00Z

kkrugler
Jan 15, 2021
Maintainer

We've been using Pinot for a recent project, and found it to be scalable and easy to use. It supports real-time ingestion and bulk loads, though real time is currently only via Kafka. It now supports querying JSON, but if you wanted to order by metadata it might require a UDF for suitable performance at scale, otherwise you could just use Grovvy scripts.

0 replies

jnioche · 2021-02-10T11:28:29Z

jnioche
Feb 10, 2021
Maintainer Author

Bubing has a Frontier implementation that could be reused and exposed through the API
The authors (@vigna) are aware of URL Frontier

0 replies

kkrugler · 2021-02-14T23:49:10Z

kkrugler
Feb 14, 2021
Maintainer

Hi @jnioche - I had a free hour or so, and decided to take a stab at porting my old DRUM implementation to url-frontier. Just to give it a home, I created a "url-crawldb" repo under this (crawler-commons) organization in GitHub. But I didn't see an interface in url-frontier that a backend service would implement, if we just wanted to plug in something to a layer that provided a common gRPC or HTTP layer - did I miss something?

0 replies

jnioche · 2021-02-15T12:09:33Z

jnioche
Feb 15, 2021
Maintainer Author

Hi @kkrugler

That's great! The reference implementation will be part of the URLFrontier project under the service module. Maybe don't describe the subproject you created as the reference implementation but "an implementation"? I'm worried that people would get confused otherwise. Or maybe rename the project to reflect that it is a DRUM implementation? Maybe separate the DRUM part from its use as a Frontier backend so that ppl can use it directly if they want to?

The interface is in the API module, it will be available as a Maven dependency soon but for now, you need to compile that module first. Have a look at the client and service modules, they use the API; also the documentation in _API_should help.

Let me know if I can help

3 replies

kkrugler Feb 17, 2021
Maintainer

I've reviewed my implementation (it's been a few years). It's really optimized for focused crawling, as it assumes every URL has an estimated (or actual) score that can be used for ordering, and partitioning. I partition into fetching, active, and archived. We put URLs in to an archived state because we found our continuous crawl getting clogged with many, many low-scoring URLs from a (relatively) small number of domains. So this process of determining what should be fetched, versus pending (active), versus archived depends on every URL having a score. That doesn't really fit with the current API, as there's no estimated/actual score.

jnioche Feb 17, 2021
Maintainer Author

hi @kkrugler
out of curiosity, how is that score computed? Could this logic be put within the service?

you could of course pass the score through the metadata but that implies that a crawler in order to use your implementation would need to provide such a score. That's probably OK though. The API would still provide a unified way of getting and putting URLs etc...

It would be valuable to have your DRUM code in the repo you created, even if we don't use it with URL Frontier

jnioche Feb 17, 2021
Maintainer Author

@kkrugler see discussion on /discussions/12 for the use of metadata and prioritisation

jnioche · 2021-04-12T15:46:52Z

jnioche
Apr 12, 2021
Maintainer Author

Have just merged an implementation of the service based on RocksDB. It seems to be doing the job very nicely so we'll be using that for the reference implementation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestions for tools behind our reference implementation of the service #2

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Suggestions for tools behind our reference implementation of the service #2

jnioche Dec 11, 2020 Maintainer

Replies: 7 comments · 6 replies

anjackson Dec 14, 2020

jnioche Dec 15, 2020 Maintainer Author

anjackson Feb 9, 2021

jnioche Feb 10, 2021 Maintainer Author

jnioche Jan 15, 2021 Maintainer Author

kkrugler Jan 15, 2021 Maintainer

jnioche Feb 10, 2021 Maintainer Author

kkrugler Feb 14, 2021 Maintainer

jnioche Feb 15, 2021 Maintainer Author

kkrugler Feb 17, 2021 Maintainer

jnioche Feb 17, 2021 Maintainer Author

jnioche Feb 17, 2021 Maintainer Author

jnioche Apr 12, 2021 Maintainer Author

jnioche
Dec 11, 2020
Maintainer

Replies: 7 comments 6 replies

anjackson
Dec 14, 2020

jnioche Dec 15, 2020
Maintainer Author

jnioche Feb 10, 2021
Maintainer Author

jnioche
Jan 15, 2021
Maintainer Author

kkrugler
Jan 15, 2021
Maintainer

jnioche
Feb 10, 2021
Maintainer Author

kkrugler
Feb 14, 2021
Maintainer

jnioche
Feb 15, 2021
Maintainer Author

kkrugler Feb 17, 2021
Maintainer

jnioche Feb 17, 2021
Maintainer Author

jnioche Feb 17, 2021
Maintainer Author

jnioche
Apr 12, 2021
Maintainer Author