Replies: 3 comments 3 replies
-
Thanks a lot @anjackson, much appreciated. I''ll start answering some of your points delay_requestable You described very well the intention behind it. metadata We could mention that a binary blob should be serialized to a String and suggest e.g. base64 for this but now that you mention it, we could also have queue prefix and multiple crawlers I hadn't really considered the use case of having different crawlers hitting the Frontier to be honest. Are you suggesting to use prefixes to distinguish between them e.g. 'browser-bl.uk' vs 'heritrix-bl.uk'? I would have had 2 different instances of the Frontier for that purpose and kept the keys to be purely hostnames or domains. Another approach would be to use the metadata and specify a filter based on a key/value but that implies the backend having some sort of indexing facility. Alternatively, it would be simpler to attach metadata to the queues themselves (instead of their content) and expose the filtering mechanism via GetParams. crawl delay management This had been discussed in a separate thread - which annoyingly I have closed and is not available anymore. This is not really what BlockQueueUntil is about. I will expand this in a separate thread as it is quite important. pause the frontier I had in the API but removed it, probably because I wasn't sure it was useful 75177aa#diff-1e84f1e5a185788e292dedd096140829ef7358c364ed23bed9a1f47c26136959L43 is that what you had in mind? URL prioritisation That's a crucial topic, I'll put that as a separate thread. |
Beta Was this translation helpful? Give feedback.
-
crawl delay and pausing of the frontier have been added in 51baee3 |
Beta Was this translation helpful? Give feedback.
-
@anjackson I have added support for handling multiple crawls within a Frontier instance see #47 |
Beta Was this translation helpful? Give feedback.
-
Hi - I'm at the UK Web Archive and we're big users of Heritrix. We're interested in shifting to a standalone Frontier because of scaling and managment issues with Heritrix's BDB-JE frontier implementation. We're also interested in things like using a mixture of crawlers, e.g. browser-based crawling behind a WARC-writing proxy (warcprox) for some pages, and traditional crawler behaviour for others.
My main concern with any external frontier is minimising the possiblity that URLs get 'dropped' when a fetcher fails. The GetParams
delay_requestable
feature should ensure that URLs sent out for crawl get re-tried if the correspondingPutURLs(KnownURLItem)
doesn't show up. Given the importance of thedelay_requestable
parameter, it's worth considering whether it should be made a required parameter (rather than an optional parameter with no default value, as it appears to be at present).As far at the
URLInfo
itself, both Heritrix3 and Scrapy allow arbitrary metadata to be attached to a URI. The URLInfo.MetadataEntry proposed in this API is a simpleString->String[]
map. This is much simpler than the Key-to-KryoBlob used by Hertirix or the PickleBlob used by Scrapy, and so I'd expect a degree of implementation friction to turn up at this point.However, for a tool-independent API, this is likely unavoidable and possibly desirable. Being able to stick arbitrary odds and ends inside a critical object can encourage brittle coding conventions and unnecessary bloat, and multi-level heirarchies of string properties can be flattened easily enough. It may just be a case of making this limitation clear in the documentation and covering off any edge cases (e.g. can the String[] hold a binary blob?).
In terms of being able to switch between crawl strategies (browser-based or traditional), the Frontier spec. probably shouldn't be involved in routing to different crawlers as it is in charge of ensuring no one site is hammered too hard. That said, perhaps GetParams.key should be a key queue prefix rather than a queue key? This would offer quite a lot of potential flexibility for things like queue-to-client management.
I'd like to check I understand how overall crawl-delay management is intended to be handled. In our current arrangement, the fetch chain updates the queue crawl delay based on various factors. Presumably this would be implemented via the
BlockQueueUntil
API?In terms of gaps, there's a couple of things that I wanted to consider.
The first was whether frontier operation should be part of the API, or whether that's out of scope. Specifically, I'd want to be able to automate the process of pausing the whole frontier. I think you're only interested in Fetcher/Frontier interactions though, in which case frontier control is out of scope.
The other aspect is URL prioritisation. It's a critical part of the role of the Frontier, but it's currently absent from the API entirely. Of course, there are lots of ways of doing prioritisation, and this makes it hard to standardise. But in the absense of standardisation, this will be dependent on out-of-band conventions that will affect compatibility between different implementations.
For example, among other things, we use hop-path depth to influence prioritisation, which works fine because the underlying frontier implementation just stores an prioritisation number that's used as a key prefix. The downside of this model is URL prioritisation is fixed at insertion time, and re-prioritising URLs is hard. The proposed Frontier API does not support prioritisation explicitly, so this would have to be a metadata key convention shared by fetcher and frontier implementation.
A more flexible approach is to prioritise when URLs are released from the frontier, e.g. by storing metadata fields like hop-path in the frontier and changing the prioritisation query to use the metadata. But this still relies on the fetchers knowing that the necessary metadata fields must be supplied.
Perhaps there's not much to be done about that at the API level. But it would be crucial for cross-crawler compatability that any metadata conventions are declared somehow.
Beta Was this translation helpful? Give feedback.
All reactions