This project demonstrates how to create an Information Retrieval System (IRS) using Xapian, a search engine library. The project indexes and queries the IMDB Top 250 movies dataset. Here's a step-by-step explanation of how it works:
The script fetch_data.py
fetches movie data from the OMDB API and saves it locally:
fetch_movies()
: Checks if data is available locally indata.json
. If not, it fetches data from the remote API usingfetch_remote()
and saves it usingsave_to_cache()
.
The script index.py
indexes the fetched movie data into a Xapian database:
main()
inindex.py
:- Fetches movies using
fetch_movies()
. - Creates a Xapian database in the
./xdb/
directory. - For each movie, it creates a Xapian document and indexes various fields (e.g., title, plot, actors, directors, year, and rated).
- Adds values for
year
andrated
to support range queries and faceting.
- Fetches movies using
The script query.py
allows querying the indexed data:
main(args)
inquery.py
:- Parses command-line arguments to get query parameters (e.g.,
keyword
,title
,rated
,year_range
,show_facets
). - Sets up a Xapian query parser and constructs a query based on the provided parameters.
- Performs the query on the Xapian database and prints the results.
- If
--show_facets
is enabled, it also prints facet counts for the rated field.
- Parses command-line arguments to get query parameters (e.g.,
Run the following command to fetch and index the movie data:
python index.py
Use the query.py
script with various options to query the indexed data. Examples:
python query.py --keyword 'love'
python query.py --title 'king'
python query.py --keyword 'love' --show_facets
python index.py
python query.py --keyword 'love' --show_facets
This will search for movies containing the keyword "love" and display facet counts for the rated
field.
Clone the project repository to your local machine:
git clone <repository-url>
cd <repository-directory>
Ensure you have Python and the required libraries installed. Use pip
to install the necessary packages:
pip install xapian
pip install requests
Run the fetch_data.py
script to fetch movie data from the OMDB API:
python fetch_data.py
Run the index.py
script to index the fetched movie data into a Xapian database:
python index.py
Use the query.py
script to query the indexed data. You can test various query parameters:
python query.py --keyword 'love'
python query.py --title 'king'
python query.py --keyword 'love' --show_facets
-
Search for movies by keyword in the plot or description:
python query.py --keyword 'adventure'
-
Search for movies with a specific title:
python query.py --title 'inception'
-
Search for movies within a specific year range:
python query.py --year_range '2000..2010'
-
Search for movies with a specific rating:
python query.py --rated 'PG-13'
-
Combine multiple parameters:
python query.py --keyword 'hero' --rated 'PG' --year_range '1990..2000'
-
Show facets (e.g., count of movies per rating):
python query.py --keyword 'space' --show_facets
-
Use Boolean operators in keyword searches:
python query.py --keyword 'action AND comedy' python query.py --keyword 'drama NOT romance'
-
Search for movies directed by a specific director:
python query.py --director 'Christopher Nolan'
-
Search for movies featuring a specific actor:
python query.py --actor 'Leonardo DiCaprio'
-
Combine title and keyword searches:
python query.py --title 'star' --keyword 'war'
-
Verify Results:
- Ensure the movies returned match the search criteria.
-
Check Facets:
- If you used
--show_facets
, ensure the facet counts are displayed correctly.
- If you used
-
Experiment:
- Try different combinations of parameters to understand how they affect the search results.
-
Location:
- The Xapian database is stored in the
./xdb/
directory by default. - You can customize the location by modifying the
index.py
andquery.py
scripts.
- The Xapian database is stored in the
-
Rebuilding the Index:
- If you make changes to the dataset or indexing logic, you need to rebuild the index:
python index.py
- If you make changes to the dataset or indexing logic, you need to rebuild the index:
-
Clearing the Database:
- To clear the existing database, delete the
./xdb/
directory:rm -rf ./xdb/
- To clear the existing database, delete the
-
Adding New Fields:
- Modify the
index.py
script to index additional fields (e.g.,genre
,language
). - Use
add_term
to include new searchable fields.
- Modify the
-
Extending Query Functionality:
- Update the
query.py
script to support new parameters for advanced queries. - Example: Add support for searching by
genre
orlanguage
.
- Update the
-
Common Issues:
- No Results Returned: Ensure the query matches the indexed data.
- Database Not Found: Verify the
./xdb/
directory exists and contains indexed data. - Facets Not Displayed: Ensure you indexed the field used for faceting (e.g.,
rated
).
-
Logging:
- Add logging to the scripts for better debugging:
import logging logging.basicConfig(level=logging.DEBUG)
- Add logging to the scripts for better debugging:
- The OMDB API may have rate limits depending on the plan you're using. To avoid issues:
- Cache data locally in
data.json
. - Avoid unnecessary re-fetching of data.
- Cache data locally in
-
Distributed Indexing:
- Implement distributed indexing to handle larger datasets.
- Use tools like Apache Kafka for data ingestion.
-
Sharding and Replication:
- Partition the Xapian database for scalability.
-
Synonym Support:
- Add query expansion to include synonyms for keywords.
-
Autocomplete:
- Implement an autocomplete feature for user-friendly searches.
-
Spell Correction:
- Use libraries like
SymSpell
or integrate with third-party APIs to suggest corrections for misspelled queries.
- Use libraries like
- Build a web-based or GUI interface to make the system accessible to non-technical users.
- Use frameworks like Flask or Django to create a front-end for querying and displaying results.
-
Search Logs:
- Capture search logs to analyze query patterns and improve relevance.
-
Relevance Feedback:
- Implement mechanisms to learn from user interactions and refine result rankings.
This project demonstrates the fundamentals of building an Information Retrieval System using Xapian. By following the provided steps, you can:
- Fetch, index, and query data from an external API.
- Implement a scalable and customizable search system.
- Extend the system with advanced features like faceted search, range queries, and Boolean operators.
Feel free to enhance the system further to suit your specific needs or datasets. Happy coding!