Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: explicit fingerprinter param in custom dupefilter #73

Merged
merged 4 commits into from
Dec 9, 2024

Conversation

SheezZarR
Copy link
Contributor

Change Summary

The scraper uses Scrapy = ">=2.2.1" as its dependency. Recent update from 2.11.* to 2.12.1 introduced a new requirement (see here):

RFPDupeFilter subclasses now require supporting the fingerprinter parameter in their init method, introduced in Scrapy 2.7.0. (issue 6102, issue 6113)

Hence I updated the signature of the class CustomDupeFilter. The explicit return statement is due to pyright complaints.

PR Checklist

@SheezZarR
Copy link
Contributor Author

Perhaps its best to pin down the versions in Piplock file?

@jasonbosco
Copy link
Member

@SheezZarR Yeah good idea - mind pinning the versions as part of this PR?

@SheezZarR
Copy link
Contributor Author

SheezZarR commented Nov 29, 2024

@jasonbosco i have updated the file.
However two tests are failing but I believe its unrelated..?

FAILED scraper/src/tests/typesense_helper/commit_tmp_test.py::test_create_tmp_collection - AssertionError: assert {'default_sor...lection', ...} == {'default_sor...lection', ...}
FAILED scraper/src/tests/typesense_helper/commit_tmp_test.py::test_create_tmp_collection_already_exists - typesense.exceptions.ObjectAlreadyExists: [Errno 409] A collection with name `collection` already exists.

I am using local typesense server of version 0.25.2-1. (here)
Unable to test with other versions atm

@tharropoulos
Copy link
Contributor

Hey @SheezZarR, thanks for submitting this PR. The two related tests should fail because of Typesense version differences, I'll check it out myself shortly to verify so

@tharropoulos
Copy link
Contributor

I am using local typesense server of version 0.25.2-1. (here)

Since I'm also using an Arch-based distribution myself, I'd suggest looking into using our docker image for local development, as the AUR package isn't actively being maintained from the looks of it

@tharropoulos
Copy link
Contributor

Can verify that this is a v0.25 Typesense server issue. The problem itself was that the server response includes a store: true value on each of the fields. Will create a PR for v27.1, but you can safely ignore these errors.

@SheezZarR
Copy link
Contributor Author

@tharropoulos thanks a lot for helping out!
I have seen that the AUR package is a bit obsolete, colleagues have strong opinions on docker, so we are not using it atm :) Will try to fix the lint issues and update the PR.

@jasonbosco
Copy link
Member

jasonbosco commented Dec 2, 2024

@SheezZarR I wonder if the linting rules got updated because the linter was not pinned. If the failing lint checks are from code that wasn't modified in this PR, feel free to pin the linter to the previous version as well.

I've also merged #74 that @tharropoulos put together. So if you merge master into your branch, those tests should also pass now.

@SheezZarR
Copy link
Contributor Author

@jasonbosco thanks I was a bit clumsy on this one. Hope the update will fix the problem

@SheezZarR
Copy link
Contributor Author

SheezZarR commented Dec 7, 2024

@jasonbosco CI/CD keeps failing in my fork in Github Actions. I'll report back when to run the Action in here.

I find it a bit strange because tests and linting are green locally.

> pipenv run ./docsearch test no_browser
Loading .env environment variables...
['pytest', './scraper/src', '-k', 'not _browser']
============================================================================================================ test session starts =============================================================================================================
platform linux -- Python 3.11.10, pytest-8.3.3, pluggy-1.5.0
rootdir: /home/SheezZarR/Documents/dev/python/typesense-docsearch-scraper
collected 104 items / 7 deselected / 97 selected                                                                                                                                                                                             

scraper/src/tests/config_loader/anchors_test.py ...                                                                                                                                                                                    [  3%]
scraper/src/tests/config_loader/basic_test.py ....                                                                                                                                                                                     [  7%]
scraper/src/tests/config_loader/domains_test.py ....                                                                                                                                                                                   [ 11%]
scraper/src/tests/config_loader/get_extra_facets_test.py .                                                                                                                                                                             [ 12%]
scraper/src/tests/config_loader/selectors_exclude_test.py ...                                                                                                                                                                          [ 15%]
scraper/src/tests/config_loader/sitemap_test.py ...                                                                                                                                                                                    [ 18%]
scraper/src/tests/config_loader/start_urls_test.py .....                                                                                                                                                                               [ 23%]
scraper/src/tests/config_loader/stop_urls_test.py ..                                                                                                                                                                                   [ 25%]
scraper/src/tests/default_strategy/custom_attributes_test.py .                                                                                                                                                                         [ 26%]
scraper/src/tests/default_strategy/default_value_test.py ......                                                                                                                                                                        [ 32%]
scraper/src/tests/default_strategy/get_anchor_test.py .......                                                                                                                                                                          [ 40%]
scraper/src/tests/default_strategy/get_hierarchy_radio_test.py ...                                                                                                                                                                     [ 43%]
scraper/src/tests/default_strategy/get_level_weight_test.py .                                                                                                                                                                          [ 44%]
scraper/src/tests/default_strategy/get_records_from_dom_test.py .................                                                                                                                                                      [ 61%]
scraper/src/tests/default_strategy/get_settings_test.py .                                                                                                                                                                              [ 62%]
scraper/src/tests/default_strategy/globals_test.py ......                                                                                                                                                                              [ 69%]
scraper/src/tests/default_strategy/meta_test.py .........                                                                                                                                                                              [ 78%]
scraper/src/tests/default_strategy/min_indexed_level_test.py .                                                                                                                                                                         [ 79%]
scraper/src/tests/default_strategy/page_rank_test.py ....                                                                                                                                                                              [ 83%]
scraper/src/tests/default_strategy/searchable_level_test.py ..                                                                                                                                                                         [ 85%]
scraper/src/tests/default_strategy/strip_chars_test.py ..                                                                                                                                                                              [ 87%]
scraper/src/tests/default_strategy/tags_test.py ...                                                                                                                                                                                    [ 90%]
scraper/src/tests/default_strategy/xpath_test.py ...                                                                                                                                                                                   [ 93%]
scraper/src/tests/typesense_helper/commit_tmp_test.py ......                                                                                                                                                                           [100%]

====================================================================================================== 97 passed, 7 deselected in 2.73s ======================================================================================================
> pipenv run pylint scraper cli deployer
Loading .env environment variables...

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)

@SheezZarR
Copy link
Contributor Author

@jasonbosco looks green to me

Some weird shenanigans with the lock-file.

@jasonbosco jasonbosco merged commit f07889e into typesense:master Dec 9, 2024
1 check passed
@jasonbosco
Copy link
Member

Thank you again for the PR @SheezZarR!

@jasonbosco
Copy link
Member

The changes are now in typesense/docsearch-scraper:0.12.0.rc2. Could you give it a shot now?

@SheezZarR
Copy link
Contributor Author

@jasonbosco yah, the scrapper works now! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants