This software project serves as supplementary material to the publication
Wójcik, F., & Górnik, M. (2020). Improvement of e-commerce recommendation systems with deep hybrid collaborative filtering with content: A case study. Econometrics. Ekonometria. Advances in Applied Data Analytics, 24(3), 37-50.
Structured as a software project rather than just a series of notebooks, this project highlights the importance of organizing machine learning projects. The use of PyTorch and PyTorch Lightning allows for easy experimentation and scaling, making it an ideal choice for building complex deep-learning models. Automating data processing pipelines simplifies the entire process and ensures that the model can be easily updated with new data.
This project offers a valuable resource for researchers and practitioners looking to connect theoretical approaches with practical implementation by providing a working implementation of the models discussed in the paper and reproducible experiment pipelines.
This is your new Kedro project, which was generated using Kedro 0.18.4
.
Take a look at the Kedro documentation to get started.
Author's official profiles can be found below:
The performance of the aforementioned algorithms was evaluated using the realworld dataset, containing customer reviews: “Amazon 2018 Reviews Dataset” which is an updated version of the previous edition of "Amazon Reviews Dataset" (R. He and McAuley, 2016; McAuley, Targett, Shi, and Van Den Hengel, 2015).
The official website of the dataset can be found here.
In order to properly process the data using pipelines:
- Open the official dataset website.
- Go to the category of research/small datasets and choose "All Beauty" data (direct link)
- Accept any terms and conditions
- Download the data
- Extract the
.json
file and save to the folder/data/01_raw/
To cite the original paper and research related to the Amazon Dataset:
Justifying recommendations using distantly-labeled reviews and fined-grained aspects, Jianmo Ni, Jiacheng Li, Julian McAuley Empirical Methods in Natural Language Processing (EMNLP), 2019
The experimental results are summarized in notebook notebooks/experiment_summary.ipynb
.
Notebook is also available as a static HTML file notebooks/experiment_summary.html
.
The experimental procedure was the same for all models:
- The data was split into train, validation and test sets.
- Each model type was cross-validated 10 times, and validation metrics were saved for further analsysis.
- Then each model was trained on a full train dataset and evaluated on the test set.
- Validation results of each model were compared using:
- The non-parametric Kruskal-Wallis test - to assess the overall difference in the performance of the models.
- Using pairwise post-hoc tests between all models, with the Bonferroni correction for multiple comparisons.
The post-hoc test results show that the Hybrid Recommender model is significantly better than the other two models in all cases with large effect sizes on all metrics (MAPE/MSE/MAE).
The pipeline is divided into 3 main parts:
- Data preparation:
- Extraction of the data from the
.json
file - Processing categorical data - extraction from the text and encoding
- Encoding user and item IDs - as consecutive integers starting from 0
- Extraction of the data from the
- Experiment preparation - repeatable splitting the data into train, validation and test sets
- Experiment execution - training and evaluation of the models, saving results.
The Kedro picture below presents the pipeline step by step:
The project is still in development and new experiments will be a part of future research publications.
The future work on Deep Hybrid Recommender will focus primarily on including the graph neural networks as recommendation models, as their usability and applicability to the e-commerce domain have been proven.
GNNs for heterogeneous domains (graphs with varied node types) are a relatively new and extensively researched topic with promising results.
Gao, C., Zheng, Y., Li, N., Li, Y., Qin, Y., Piao, J., ... & Li, Y. (2021). Graph neural networks for recommender systems: Challenges, methods, and directions. arXiv preprint arXiv:2109.12843.
Wu, S., Sun, F., Zhang, W., Xie, X., & Cui, B. (2022). Graph neural networks in recommender systems: a survey. ACM Computing Surveys, 55(5), 1-37
Declare any dependencies in src/requirements.txt
for pip
installation and src/environment.yml
for conda
installation.
To install them, run:
pip install -r src/requirements.txt
You can run your Kedro project with:
kedro run
To run specific pipeline node, execute:
kedro run --nodes NODE
You can visualize your Kedro pipeline with:
kedro viz
from the main project directory. This will open a browser window with the interactive visualization of your pipeline.
You can interact with the pipeline itself or open the experiment tracking dashboard to seed details of the runs.
Note: Using
kedro jupyter
orkedro ipython
to run your notebook provides these variables in scope:context
,catalog
, andstartup_error
.Jupyter, JupyterLab, and IPython are already included in the project requirements by default, so once you have run
pip install -r src/requirements.txt
you will not need to take any extra steps before you use them.
To use Jupyter notebooks in your Kedro project, you need to install Jupyter:
pip install jupyter
After installing Jupyter, you can start a local notebook server:
kedro jupyter notebook
To use JupyterLab, you need to install it:
pip install jupyterlab
You can also start JupyterLab:
kedro jupyter lab
McAuley, J., Targett, C., Shi, Q., and Van Den Hengel, A. (2015). Image-based recommendations on styles and substitutes. SIGIR 2015 -Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval.
He, R., and McAuley, J. (2016). Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In 25th International World Wide Web Conference, WWW 2016.