extract rules for using LLM, and use it for non-ai #448

AzizNadirov · 2025-01-13T19:35:01Z

AzizNadirov
Jan 13, 2025

Hi, can I extract information from source(crawling) by providing model(product_name, price etc), get result, and extract its css schema, for reusing for other pages in nonai - css manner ?

Answered by aravindkarnam

Jan 19, 2025

@AzizNadirov To best of my knowledge, we currently don't have a feature like this. We will certainly keep this use case in mind while planning our future roadmap.

However you can get the raw html from the crawler result using result.html then have a model(like chatGPT or Claude) to workout the mapping between classes/id/name etc attributes of divs vs desired data fields. Then you can extract using the help of JsonCssExtractionStrategy or JsonXPathExtractionStrategy .

You can find some useful examples here

Cc: @unclecode Interesting use case ☝🏼

View full answer

aravindkarnam · 2025-01-15T13:39:14Z

aravindkarnam
Jan 15, 2025
Collaborator

@AzizNadirov Let me try to unpack your question here. You expect to provide the product_name, price etc and you expect the model/algorithm to map the html id/class name etc of the divs to these specific data fields.
Then in the next go you expect the crawler to extract the data based on the html/styling tags and return it? Am I understanding this correctly?

4 replies

AzizNadirov Jan 15, 2025
Author

Yes, exactly.

aravindkarnam Jan 19, 2025
Collaborator

@AzizNadirov To best of my knowledge, we currently don't have a feature like this. We will certainly keep this use case in mind while planning our future roadmap.

However you can get the raw html from the crawler result using result.html then have a model(like chatGPT or Claude) to workout the mapping between classes/id/name etc attributes of divs vs desired data fields. Then you can extract using the help of JsonCssExtractionStrategy or JsonXPathExtractionStrategy .

You can find some useful examples here

Cc: @unclecode Interesting use case ☝🏼

Answer selected by AzizNadirov

AzizNadirov Jan 19, 2025
Author

Thanks, now I am working on it.

unclecode Jan 19, 2025
Maintainer

@AzizNadirov Actually, there’s a helper static function in the JsonCss and JsonXpath extraction strategies. You pass the raw HTML, and it returns the schema, so you only need to call the LLM once, voilà! I’ve been testing it for a while and might merge it into the main branch in the next version.

@aravindkarnam

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract rules for using LLM, and use it for non-ai #448

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

extract rules for using LLM, and use it for non-ai #448

AzizNadirov Jan 13, 2025

Replies: 1 comment · 4 replies

aravindkarnam Jan 15, 2025 Collaborator

AzizNadirov Jan 15, 2025 Author

aravindkarnam Jan 19, 2025 Collaborator

AzizNadirov Jan 19, 2025 Author

unclecode Jan 19, 2025 Maintainer

AzizNadirov
Jan 13, 2025

Replies: 1 comment 4 replies

aravindkarnam
Jan 15, 2025
Collaborator

AzizNadirov Jan 15, 2025
Author

aravindkarnam Jan 19, 2025
Collaborator

AzizNadirov Jan 19, 2025
Author

unclecode Jan 19, 2025
Maintainer