Some feedback on the URL Frontier API #11

anjackson · 2021-02-09T10:21:25Z

anjackson
Feb 9, 2021

Hi - I'm at the UK Web Archive and we're big users of Heritrix. We're interested in shifting to a standalone Frontier because of scaling and managment issues with Heritrix's BDB-JE frontier implementation. We're also interested in things like using a mixture of crawlers, e.g. browser-based crawling behind a WARC-writing proxy (warcprox) for some pages, and traditional crawler behaviour for others.

My main concern with any external frontier is minimising the possiblity that URLs get 'dropped' when a fetcher fails. The GetParams delay_requestable feature should ensure that URLs sent out for crawl get re-tried if the corresponding PutURLs(KnownURLItem) doesn't show up. Given the importance of the delay_requestable parameter, it's worth considering whether it should be made a required parameter (rather than an optional parameter with no default value, as it appears to be at present).

As far at the URLInfo itself, both Heritrix3 and Scrapy allow arbitrary metadata to be attached to a URI. The URLInfo.MetadataEntry proposed in this API is a simple String->String[] map. This is much simpler than the Key-to-KryoBlob used by Hertirix or the PickleBlob used by Scrapy, and so I'd expect a degree of implementation friction to turn up at this point.

However, for a tool-independent API, this is likely unavoidable and possibly desirable. Being able to stick arbitrary odds and ends inside a critical object can encourage brittle coding conventions and unnecessary bloat, and multi-level heirarchies of string properties can be flattened easily enough. It may just be a case of making this limitation clear in the documentation and covering off any edge cases (e.g. can the String[] hold a binary blob?).

In terms of being able to switch between crawl strategies (browser-based or traditional), the Frontier spec. probably shouldn't be involved in routing to different crawlers as it is in charge of ensuring no one site is hammered too hard. That said, perhaps GetParams.key should be a key queue prefix rather than a queue key? This would offer quite a lot of potential flexibility for things like queue-to-client management.

I'd like to check I understand how overall crawl-delay management is intended to be handled. In our current arrangement, the fetch chain updates the queue crawl delay based on various factors. Presumably this would be implemented via the BlockQueueUntil API?

In terms of gaps, there's a couple of things that I wanted to consider.

The first was whether frontier operation should be part of the API, or whether that's out of scope. Specifically, I'd want to be able to automate the process of pausing the whole frontier. I think you're only interested in Fetcher/Frontier interactions though, in which case frontier control is out of scope.

The other aspect is URL prioritisation. It's a critical part of the role of the Frontier, but it's currently absent from the API entirely. Of course, there are lots of ways of doing prioritisation, and this makes it hard to standardise. But in the absense of standardisation, this will be dependent on out-of-band conventions that will affect compatibility between different implementations.

For example, among other things, we use hop-path depth to influence prioritisation, which works fine because the underlying frontier implementation just stores an prioritisation number that's used as a key prefix. The downside of this model is URL prioritisation is fixed at insertion time, and re-prioritising URLs is hard. The proposed Frontier API does not support prioritisation explicitly, so this would have to be a metadata key convention shared by fetcher and frontier implementation.

A more flexible approach is to prioritise when URLs are released from the frontier, e.g. by storing metadata fields like hop-path in the frontier and changing the prioritisation query to use the metadata. But this still relies on the fetchers knowing that the necessary metadata fields must be supplied.

Perhaps there's not much to be done about that at the API level. But it would be crucial for cross-crawler compatability that any metadata conventions are declared somehow.

jnioche · 2021-02-10T08:12:04Z

jnioche
Feb 10, 2021
Maintainer

Thanks a lot @anjackson, much appreciated. I''ll start answering some of your points

delay_requestable

You described very well the intention behind it.
Annoyingly you can't make a field required nor give it an explicit default value in protobuf v3. The best thing we could do would be to explicitly mention that if the default value of 0 is set, the implementation should use a value set by convention (300?).

metadata

We could mention that a binary blob should be serialized to a String and suggest e.g. base64 for this but now that you mention it, we could also have String->byte[][].

queue prefix and multiple crawlers

I hadn't really considered the use case of having different crawlers hitting the Frontier to be honest. Are you suggesting to use prefixes to distinguish between them e.g. 'browser-bl.uk' vs 'heritrix-bl.uk'?

I would have had 2 different instances of the Frontier for that purpose and kept the keys to be purely hostnames or domains.

Another approach would be to use the metadata and specify a filter based on a key/value but that implies the backend having some sort of indexing facility. Alternatively, it would be simpler to attach metadata to the queues themselves (instead of their content) and expose the filtering mechanism via GetParams.

crawl delay management

This had been discussed in a separate thread - which annoyingly I have closed and is not available anymore. This is not really what BlockQueueUntil is about. I will expand this in a separate thread as it is quite important.

pause the frontier

I had in the API but removed it, probably because I wasn't sure it was useful

75177aa#diff-1e84f1e5a185788e292dedd096140829ef7358c364ed23bed9a1f47c26136959L43

is that what you had in mind?

URL prioritisation

That's a crucial topic, I'll put that as a separate thread.

3 replies

anjackson Aug 24, 2021
Author

Thanks for this. Happy with your suggestions above. I think the multiple crawlers thing is probably better handled elsewhere (either above multiple frontiers or within the frontier client), for the time being at least.

anjackson Aug 25, 2021
Author

Although I'm still not understanding the difference between BlockQueueUntil and SetDelay 😂

jnioche Sep 3, 2021
Maintainer

Although I'm still not understanding the difference between BlockQueueUntil and SetDelay

@anjackson

the latter set the frequency at which URLs from a given queue should be emitted (typically if the robots.txt says e.g. 30secs) whereas the former is a set point in time until which the queue won't emit anything. you can think of it as similar to pause but for a specific queue.

jnioche · 2021-03-26T11:07:44Z

jnioche
Mar 26, 2021
Maintainer

@anjackson

crawl delay and pausing of the frontier have been added in 51baee3

0 replies

jnioche · 2022-02-16T09:51:15Z

jnioche
Feb 16, 2022
Maintainer

I think the multiple crawlers thing is probably better handled elsewhere (either above multiple frontiers or within the frontier client), for the time being at least.

@anjackson I have added support for handling multiple crawls within a Frontier instance

see #47

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some feedback on the URL Frontier API #11

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Some feedback on the URL Frontier API #11

anjackson Feb 9, 2021

Replies: 3 comments · 3 replies

jnioche Feb 10, 2021 Maintainer

anjackson Aug 24, 2021 Author

anjackson Aug 25, 2021 Author

jnioche Sep 3, 2021 Maintainer

jnioche Mar 26, 2021 Maintainer

jnioche Feb 16, 2022 Maintainer

anjackson
Feb 9, 2021

Replies: 3 comments 3 replies

jnioche
Feb 10, 2021
Maintainer

anjackson Aug 24, 2021
Author

anjackson Aug 25, 2021
Author

jnioche Sep 3, 2021
Maintainer

jnioche
Mar 26, 2021
Maintainer

jnioche
Feb 16, 2022
Maintainer