Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the monty #16

Merged
merged 21 commits into from
Aug 15, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
housekeeping/create_dataset/*.json
tmdb_solr*.json
solr_home/tmdb/conf/*
venv
*.pyc
tmdb.json
Expand All @@ -7,3 +10,4 @@ bin/
.idea/
*.cfg
.DS*
__pycache__
28 changes: 0 additions & 28 deletions .testing/README.md

This file was deleted.

12 changes: 0 additions & 12 deletions Pipfile

This file was deleted.

60 changes: 27 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,25 +2,32 @@ Solr Index for the [The Movie Database](http://themoviedb.com).

This repository is part of the _Think Like a Relevancy Engineer_ training provided by [OpenSource Connections](https://opensourceconnections.com/events/training/).

The code in this repo requires [Python 3](https://www.python.org/downloads/). So if you have both Python 2 and Python 3 installed, you may need to append the version number to your `python` commands or set up an appropriate virtual environment.
## Steps to get up and running:
- Download this repo
- Install the software (using either Docker or installing manually)
- Index the TMDB movie data
- Confirm Solr has the data
- Install Postman (optional)

```
python3 indexTmdb.py
```
# Download this repo

# Clone this repo
Download the zip from https://github.com/o19s/solr-tmdb/archive/master.zip

or clone it:

```
git clone https://github.com/o19s/solr-tmdb.git
```

After you clone this repo, change into the newly created directory.
After you have this repo, change into the newly created directory.

# Run Solr index
# Install Solr

Two options exist to run Solr.
Two options exist to run Solr locally, however if neither of them will work for you, we do
have a public version of this dataset deployed at http://quepid-solr.dev.o19s.com:8985/solr/ that
you can use during the class as well.

### Docker option (recomended)
### Docker option (recommended)

If you have [Docker](https://www.docker.com/products/docker-desktop) installed and running.

Expand Down Expand Up @@ -51,54 +58,41 @@ Regardless of the option you choose, navigate to [http://localhost:8983/solr/](h

# Index TMDB movies

1. Download [tmdb.json](https://o19s-public-datasets.s3.amazonaws.com/tmdb.json)

```
curl -o tmdb.json https://o19s-public-datasets.s3.amazonaws.com/tmdb.json
```

2. Install the [pysolr](https://github.com/django-haystack/pysolr) library

Recomended: set up a virtual environment.
Unzip the `tmdb_solr.json.zip` file first.

```
python3 -m venv venv
unzip tmdb_solr.json.zip
```

then
Then send the unzipped `tmdb_solr.json` into Solr.

```
source venv/bin/activate
./index.sh
```

Required: install dependencies
or

```
pip3 install -r requirements.txt
curl 'http://localhost:8983/solr/tmdb/update?commit=true' --data-binary @tmdb_solr.json -H 'Content-type:application/json'
```


3. Index movies

```
python3 indexTmdb.py
```
You are indexing a *big 100 mb file*, so this will take up to five minutes!

# Confirm Solr has TMDB movies

Navigate [here](http://localhost:8983/solr/tmdb/select?q=title:lego) and confirm you get results.

If you don't see any results, trigger a [manual commit](http://localhost:8983/solr/tmdb/update?commit=true).


# Postman

[Postman](https://www.postman.com/) is an API development tool, that helps build, run and manage API requests. The examples from the TLRE slides exist here too as a Postman Collection (`solr-TLRE-postman_collection.json`). We like using Postman becasue it makes tinkering with query parameters nicer and we think it is a useful way to follow along as you learn about tuning search relevance.
[Postman](https://www.postman.com/) is an API development tool, that helps build, run and manage API requests. The examples from the TLRE slides exist here too as a Postman Collection (`solr-postman_collection.json`). We like using Postman because it makes tinkering with query parameters nicer and we think it is a useful way to follow along as you learn about tuning search relevance.

If you want to use Postman during the TLRE class:

1. Download [Postman](https://www.postman.com/downloads/) for your OS
2. Open Postman and Import (top-menu >> File) `solr-TLRE-postman_collection.json`
3. Define a global variable (grey eye icon in the upper-right) `solr-host` to point to your running Elasticsearch instance (default is `localhost:8983`)
2. Open Postman and Import (top-menu >> File) `solr-postman-collection.json`
3. Define a global variable (grey eye icon in the upper-right) `solr_host` to point to your running Solr instance (default is `localhost:8983`)
4. Tinker with the base URL, Params or JSON Body (optional)
5. Press 'Send' (blue rectangle button right of URL bar)

8 changes: 0 additions & 8 deletions cleanMovies.py

This file was deleted.

6 changes: 0 additions & 6 deletions deleteTmdb.py

This file was deleted.

1 change: 0 additions & 1 deletion editMovies.py

This file was deleted.

33 changes: 33 additions & 0 deletions housekeeping/create_dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Generating the TMDB dataset

Periodically we update the TMDB dataset as new movies come out, or new data sources are added.

1. Get the latest TMDB dump using the https://github.com/o19s/tmdb_dump project.

2. Create the Solr schema formatted JSON file:

Pass in the TMDB extract file and the name of the resulting Solr JSON file.

```
python3 createSolrTmdbDataset.py tmdb_2020-08-10.json tmdb_solr.json
```

3. Zip and store the file in the root directory

```
zip tmdb_solr.json.zip tmdb_solr.json
cp ../../
```


https://raw.githubusercontent.com/o19s/tmdb_dump/master/tmdb_dataflows.png

# Understanding Data Structure

You can use `jq` to parse the JSON. Just unzip a chunk and then do:

> cat tmdb_solr_2020-08-11.json | jq .

Or, to look at a specific movie dataset, look it up by id:

> jq '.[] | select(.id=="87381")' tmdb_solr_2020-08-11.json
27 changes: 14 additions & 13 deletions indexTmdb.py → ...g/create_dataset/createSolrTmdbDataset.py
Original file line number Diff line number Diff line change
@@ -1,25 +1,23 @@
import pysolr
from tmdbMovies import tmdbMovies
from tmdbMovies import writeTmdbMovies

def indexableMovies():
""" Generates TMDB movies, similar to how ES Bulk indexing
uses a generator to generate bulk index/update actions """
from tmdbMovies import tmdbMovies
for movieId, tmdbMovie in tmdbMovies():
print("Indexing %s" % movieId)
def indexableMovies(tmdb_source_file):
""" Generates TMDB movies in Solr JSON format """

for movieId, tmdbMovie in tmdbMovies(tmdb_source_file):
print("Formatting %s" % movieId)
try:
releaseDate = None
if 'release_date' in tmdbMovie and len(tmdbMovie['release_date']) > 0:
releaseDate = tmdbMovie['release_date'] + 'T00:00:00Z'

yield {'id': movieId,
'title': tmdbMovie['title'],
'overview': tmdbMovie['overview'],
'tagline': tmdbMovie['tagline'],
'poster_path': 'https://image.tmdb.org/t/p/w185' + tmdbMovie['poster_path'],
'cast_nomv': " ".join([castMember['name'] for castMember in tmdbMovie['cast']]),
'directors': [director['name'] for director in tmdbMovie['directors']],
'cast': [castMember['name'] for castMember in tmdbMovie['cast']],
'genres': [genre['name'] for genre in tmdbMovie['genres']],
'release_date': releaseDate,
'release_date': tmdbMovie['release_date'] + 'T00:00:00Z',
'vote_average': tmdbMovie['vote_average'] if 'vote_average' in tmdbMovie else None,
'vote_count': int(tmdbMovie['vote_count']) if 'vote_count' in tmdbMovie else None,
}
Expand All @@ -29,5 +27,8 @@ def indexableMovies():


if __name__ == "__main__":
solr = pysolr.Solr('http://localhost:8983/solr/tmdb', timeout=100)
solr.add(list(indexableMovies()), commit=True)
from sys import argv

tmdb_source_file = argv[1]
tmdb_solr_file = argv[2]
writeTmdbMovies(list(indexableMovies(tmdb_source_file=tmdb_source_file)),tmdb_solr_file)
15 changes: 15 additions & 0 deletions housekeeping/create_dataset/tmdbMovies.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
import json


def rawTmdbMovies(tmdb_source_file):
return json.load(open(tmdb_source_file))


def writeTmdbMovies(rawMoviesJson, path):
with open(path, 'w') as f:
json.dump(rawMoviesJson, f)

def tmdbMovies(tmdb_source_file):
tmdbMovies = rawTmdbMovies(tmdb_source_file)
for movieId, tmdbMovie in tmdbMovies.items():
yield (movieId, tmdbMovie)
35 changes: 35 additions & 0 deletions housekeeping/testing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Testing TLRE examples

TLRE examples are vunerable to changes in external tooling (Splainer, Quepid) and Solr itself. So to ensure things are ready to go for training we've scripted these "tests" to check all of the examples.

## Splainer

These tests check that changes to Splainer don't damage TLRE examples.

Splainer links from the slides are stored in `splainer_links_solr.csv`. The script `splainer_puppet_solr.py` will visit each one of the links and report the HTTP status code back.

These tests assume you are running the local Solr TMDB setup.

Setup your virtual environment:
```
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```

Run regression tests
```
python3 splainer_puppet_solr.py
```

This will record the status code in the CSV file and print the number of failed queries to console.

## Newman

These tests check that version changes in Solr don't damage TLRE examples.

[Newman](https://github.com/postmanlabs/newman) is the command line tool for managing Postman collections. All examples from the class, beyond just the links to Splainer, are included in the collection `../solr-postman-collection.json`

```
newman run --global-var "solr_host=localhost:8983" ../../solr-postman-collection.json
```
File renamed without changes.
3 changes: 3 additions & 0 deletions index.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/bash

curl 'http://localhost:8983/solr/tmdb/update?commit=true' --data-binary @tmdb_solr.json -H 'Content-type:application/json'
25 changes: 0 additions & 25 deletions indexEx1.py

This file was deleted.

51 changes: 0 additions & 51 deletions ltr/README.md

This file was deleted.

Loading