o19s · nathancday · Aug 15, 2020 · Aug 10, 2020 · Aug 11, 2020 · Aug 11, 2020
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,6 @@
+housekeeping/create_dataset/*.json
+tmdb_solr*.json
+solr_home/tmdb/conf/*
 venv
 *.pyc
 tmdb.json
@@ -7,3 +10,4 @@ bin/
 .idea/
 *.cfg
 .DS*
+__pycache__
diff --git a/.testing/README.md b/.testing/README.md
diff --git a/Pipfile b/Pipfile
diff --git a/README.md b/README.md
@@ -2,25 +2,32 @@ Solr Index for the [The Movie Database](http://themoviedb.com).
 
 This repository is part of the _Think Like a Relevancy Engineer_ training provided by [OpenSource Connections](https://opensourceconnections.com/events/training/).
 
-The code in this repo requires [Python 3](https://www.python.org/downloads/). So if you have both Python 2 and Python 3 installed, you may need to append the version number to your `python` commands or set up an appropriate virtual environment.
+## Steps to get up and running:
+- Download this repo
+- Install the software (using either Docker or installing manually)
+- Index the TMDB movie data
+- Confirm Solr has the data
+- Install Postman (optional)
 
-```
- python3 indexTmdb.py
-```
+# Download this repo
 
-# Clone this repo
+Download the zip from https://github.com/o19s/solr-tmdb/archive/master.zip
+
+or clone it:
 
 ```
 git clone https://github.com/o19s/solr-tmdb.git
 ```
 
-After you clone this repo, change into the newly created directory.
+After you have this repo, change into the newly created directory.
 
-# Run Solr index
+# Install Solr
 
-Two options exist to run Solr.
+Two options exist to run Solr locally, however if neither of them will work for you, we do
+have a public version of this dataset deployed at http://quepid-solr.dev.o19s.com:8985/solr/ that
+you can use during the class as well.
 
-### Docker option (recomended)
+### Docker option (recommended)
 
 If you have [Docker](https://www.docker.com/products/docker-desktop) installed and running.
 
@@ -51,54 +58,41 @@ Regardless of the option you choose, navigate to [http://localhost:8983/solr/](h
 
 # Index TMDB movies
 
-1. Download [tmdb.json](https://o19s-public-datasets.s3.amazonaws.com/tmdb.json)
-
-```
-curl -o tmdb.json https://o19s-public-datasets.s3.amazonaws.com/tmdb.json
-```
-
-2. Install the [pysolr](https://github.com/django-haystack/pysolr) library
-
-Recomended: set up a virtual environment.
+Unzip the `tmdb_solr.json.zip` file first.
 
 ```
-python3 -m venv venv
+unzip tmdb_solr.json.zip
 ```
 
-then
+Then send the unzipped `tmdb_solr.json` into Solr.
 
 ```
-source venv/bin/activate
+./index.sh
 ```
 
-Required: install dependencies
+or
 
 ```
-pip3 install -r requirements.txt
+curl 'http://localhost:8983/solr/tmdb/update?commit=true' --data-binary @tmdb_solr.json -H 'Content-type:application/json'
 ```
 
-
-3. Index movies
-
-```
-python3 indexTmdb.py
-```
+You are indexing a *big 100 mb file*, so this will take up to five minutes!
 
 # Confirm Solr has TMDB movies
 
 Navigate [here](http://localhost:8983/solr/tmdb/select?q=title:lego) and confirm you get results.
 
 If you don't see any results, trigger a [manual commit](http://localhost:8983/solr/tmdb/update?commit=true).
 
+
 # Postman
 
-[Postman](https://www.postman.com/) is an API development tool, that helps build, run and manage API requests. The examples from the TLRE slides exist here too as a Postman Collection (`solr-TLRE-postman_collection.json`). We like using Postman becasue it makes tinkering with query parameters nicer and we think it is a useful way to follow along as you learn about tuning search relevance.
+[Postman](https://www.postman.com/) is an API development tool, that helps build, run and manage API requests. The examples from the TLRE slides exist here too as a Postman Collection (`solr-postman_collection.json`). We like using Postman because it makes tinkering with query parameters nicer and we think it is a useful way to follow along as you learn about tuning search relevance.
 
 If you want to use Postman during the TLRE class:
 
 1. Download [Postman](https://www.postman.com/downloads/) for your OS
-2. Open Postman and Import (top-menu >> File) `solr-TLRE-postman_collection.json`
-3. Define a global variable (grey eye icon in the upper-right) `solr-host` to point to your running Elasticsearch instance (default is `localhost:8983`)
+2. Open Postman and Import (top-menu >> File) `solr-postman-collection.json`
+3. Define a global variable (grey eye icon in the upper-right) `solr_host` to point to your running Solr instance (default is `localhost:8983`)
 4. Tinker with the base URL, Params or JSON Body (optional)
 5. Press 'Send' (blue rectangle button right of URL bar)
-
diff --git a/cleanMovies.py b/cleanMovies.py
diff --git a/deleteTmdb.py b/deleteTmdb.py
diff --git a/editMovies.py b/editMovies.py
diff --git a/housekeeping/create_dataset/README.md b/housekeeping/create_dataset/README.md
@@ -0,0 +1,33 @@
+# Generating the TMDB dataset
+
+Periodically we update the TMDB dataset as new movies come out, or new data sources are added.
+
+1. Get the latest TMDB dump using the https://github.com/o19s/tmdb_dump project.
+
+2. Create the Solr schema formatted JSON file:
+
+Pass in the TMDB extract file and the name of the resulting Solr JSON file.
+
+```
+python3 createSolrTmdbDataset.py tmdb_2020-08-10.json tmdb_solr.json
+```
+
+3. Zip and store the file in the root directory
+
+```
+zip tmdb_solr.json.zip tmdb_solr.json
+cp ../../
+```
+
+
+https://raw.githubusercontent.com/o19s/tmdb_dump/master/tmdb_dataflows.png
+
+# Understanding Data Structure
+
+You can use `jq` to parse the JSON.   Just unzip a chunk and then do:
+
+> cat tmdb_solr_2020-08-11.json | jq .
+
+Or, to look at a specific movie dataset, look it up by id:
+
+> jq '.[] | select(.id=="87381")' tmdb_solr_2020-08-11.json
diff --git a/indexTmdb.py → ...g/create_dataset/createSolrTmdbDataset.py b/indexTmdb.py → ...g/create_dataset/createSolrTmdbDataset.py
@@ -1,25 +1,23 @@
-import pysolr
+from tmdbMovies import tmdbMovies
+from tmdbMovies import writeTmdbMovies
 
-def indexableMovies():
-    """ Generates TMDB movies, similar to how ES Bulk indexing
-        uses a generator to generate bulk index/update actions """
-    from tmdbMovies import tmdbMovies
-    for movieId, tmdbMovie in tmdbMovies():
-        print("Indexing %s" % movieId)
+def indexableMovies(tmdb_source_file):
+    """ Generates TMDB movies in Solr JSON format """
+
+    for movieId, tmdbMovie in tmdbMovies(tmdb_source_file):
+        print("Formatting %s" % movieId)
         try:
-            releaseDate = None
-            if 'release_date' in tmdbMovie and len(tmdbMovie['release_date']) > 0:
-                releaseDate = tmdbMovie['release_date'] + 'T00:00:00Z'
 
             yield {'id': movieId,
                    'title': tmdbMovie['title'],
                    'overview': tmdbMovie['overview'],
                    'tagline': tmdbMovie['tagline'],
+                   'poster_path': 'https://image.tmdb.org/t/p/w185' + tmdbMovie['poster_path'],
                    'cast_nomv': " ".join([castMember['name'] for castMember in tmdbMovie['cast']]),
                    'directors': [director['name'] for director in tmdbMovie['directors']],
                    'cast': [castMember['name'] for castMember in tmdbMovie['cast']],
                    'genres': [genre['name'] for genre in tmdbMovie['genres']],
-                   'release_date': releaseDate,
+                   'release_date': tmdbMovie['release_date'] + 'T00:00:00Z',
                    'vote_average': tmdbMovie['vote_average'] if 'vote_average' in tmdbMovie else None,
                    'vote_count': int(tmdbMovie['vote_count']) if 'vote_count' in tmdbMovie else None,
                    }
@@ -29,5 +27,8 @@ def indexableMovies():
 
 
 if __name__ == "__main__":
-    solr = pysolr.Solr('http://localhost:8983/solr/tmdb', timeout=100)
-    solr.add(list(indexableMovies()), commit=True)
+    from sys import argv
+
+    tmdb_source_file = argv[1]
+    tmdb_solr_file = argv[2]
+    writeTmdbMovies(list(indexableMovies(tmdb_source_file=tmdb_source_file)),tmdb_solr_file)
diff --git a/housekeeping/create_dataset/tmdbMovies.py b/housekeeping/create_dataset/tmdbMovies.py
@@ -0,0 +1,15 @@
+import json
+
+
+def rawTmdbMovies(tmdb_source_file):
+    return json.load(open(tmdb_source_file))
+
+
+def writeTmdbMovies(rawMoviesJson, path):
+    with open(path, 'w') as f:
+        json.dump(rawMoviesJson, f)
+
+def tmdbMovies(tmdb_source_file):
+    tmdbMovies = rawTmdbMovies(tmdb_source_file)
+    for movieId, tmdbMovie in tmdbMovies.items():
+        yield (movieId, tmdbMovie)
diff --git a/housekeeping/testing/README.md b/housekeeping/testing/README.md
@@ -0,0 +1,35 @@
+# Testing TLRE examples
+
+TLRE examples are vunerable to changes in external tooling (Splainer, Quepid) and Solr itself. So to ensure things are ready to go for training we've scripted these "tests" to check all of the examples.
+
+## Splainer
+
+These tests check that changes to Splainer don't damage TLRE examples.
+
+Splainer links from the slides are stored in `splainer_links_solr.csv`. The script `splainer_puppet_solr.py` will visit each one of the links and report the HTTP status code back.
+
+These tests assume you are running the local Solr TMDB setup.
+
+Setup your virtual environment:
+```
+python3 -m venv venv
+source venv/bin/activate
+pip install -r requirements.txt
+```
+
+Run regression tests
+```
+python3 splainer_puppet_solr.py
+```
+
+This will record the status code in the CSV file and print the number of failed queries to console.
+
+## Newman
+
+These tests check that version changes in Solr don't damage TLRE examples.
+
+[Newman](https://github.com/postmanlabs/newman) is the command line tool for managing Postman collections. All examples from the class, beyond just the links to Splainer, are included in the collection `../solr-postman-collection.json`
+
+```
+newman run --global-var "solr_host=localhost:8983" ../../solr-postman-collection.json
+```
diff --git a/.testing/requirements.txt → housekeeping/testing/requirements.txt b/.testing/requirements.txt → housekeeping/testing/requirements.txt
diff --git a/.testing/splainer_links_solr.csv → housekeeping/testing/splainer_links_solr.csv b/.testing/splainer_links_solr.csv → housekeeping/testing/splainer_links_solr.csv
diff --git a/.testing/splainer_puppet_solr.py → housekeeping/testing/splainer_puppet_solr.py b/.testing/splainer_puppet_solr.py → housekeeping/testing/splainer_puppet_solr.py
diff --git a/index.sh b/index.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+
+curl 'http://localhost:8983/solr/tmdb/update?commit=true' --data-binary @tmdb_solr.json -H 'Content-type:application/json'
diff --git a/indexEx1.py b/indexEx1.py
diff --git a/ltr/README.md b/ltr/README.md
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		#!/bin/bash

		curl 'http://localhost:8983/solr/tmdb/update?commit=true' --data-binary @tmdb_solr.json -H 'Content-type:application/json'