Skip to content

Commit

Permalink
Fix hierarchy_radio error when scraping
Browse files Browse the repository at this point in the history
  • Loading branch information
jasonbosco committed Dec 27, 2024
1 parent f07889e commit 679c127
Show file tree
Hide file tree
Showing 4 changed files with 23 additions and 9 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,4 @@ geckodriver.log

configs/private
/typesense-server-data/
typesense-data/
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,13 @@ Remember to change the version numbers in the URL as needed.

This section only applies if you're making changes to this scraper itself. If you only need to run the scraper, see Usage instructions above.

#### Running the code locally

```shellsession
$ pipenv shell
$ ./docsearch run configs/public/typesense_docs.json
```

#### Releasing a new version

Basic/abbreviated instructions:
Expand Down
14 changes: 9 additions & 5 deletions configs/public/typesense_docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
"url": "https://typesense.org/docs/(?P<version>.*?)/",
"variables": {
"version": [
"0.21.0"
"27.1"
]
}
}
Expand All @@ -22,9 +22,13 @@
},
"scrape_start_urls": false,
"strip_chars": " .,;:#",
"nb_hits": 505,
"custom_settings": {
"token_separators": ["_"],
"symbols_to_index": ["*"]
}
"token_separators": [
"_"
],
"symbols_to_index": [
"*"
]
},
"nb_hits": 16502
}
10 changes: 6 additions & 4 deletions scraper/src/typesense_helper.py
Original file line number Diff line number Diff line change
Expand Up @@ -209,10 +209,12 @@ def transform_record(record):

# Flatten nested hierarchy fields
for x in range(0, 7):
if record['hierarchy'][f'lvl{x}'] is not None:
transformed_record[f'hierarchy.lvl{x}'] = record['hierarchy'][f'lvl{x}']
if record['hierarchy_radio'][f'lvl{x}'] is not None:
transformed_record[f'hierarchy_radio.lvl{x}'] = record['hierarchy_radio'][f'lvl{x}']
if 'hierarchy' in record and f'lvl{x}' in record['hierarchy']:
if record['hierarchy'][f'lvl{x}'] is not None:
transformed_record[f'hierarchy.lvl{x}'] = record['hierarchy'][f'lvl{x}']
if 'hierarchy_radio' in record and f'lvl{x}' in record['hierarchy_radio']:
if record['hierarchy_radio'][f'lvl{x}'] is not None:
transformed_record[f'hierarchy_radio.lvl{x}'] = record['hierarchy_radio'][f'lvl{x}']

# Convert version to array
if 'version' in record and type(record['version']) == str:
Expand Down

0 comments on commit 679c127

Please sign in to comment.