Alix Zhou, Paramveer Singh, Susannah Sun, Zoe Ren
This analysis investigates the relationship between physicochemical properties and wine quality using the Wine Quality dataset from the UCI Machine Learning Repository, containing data for both red and white wine. Through comprehensive exploratory data analysis, we examined 11 physicochemical features and their correlations with wine quality scores. Our analysis revealed that higher quality wines typically have higher alcohol content and lower volatile acidity, with white wines generally receiving higher quality scores than red wines. Most features showed right-skewed distributions with notable outliers, particularly in sulfur dioxide and residual sugar measurements. The quality scores themselves followed a normal distribution centered around scores 5-6.
We implemented a logistic regression model with standardized features and one-hot encoded categorical variables, using randomized search cross-validation to optimize the regularization parameter. The final model achieved an accuracy of 54% on the test set. While this performance suggests room for improvement, the analysis provides valuable insights for future research directions.
The dataset used in this project is the Wine Quality dataset from the UCI Machine Learning Repository (Cortez et al. 2009) and can be found here These datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. They contains physicochemical properties (e.g., acidity, sugar content, and alcohol) of different wine samples, alongside a sensory score representing the quality of the wine, rated by experts on a scale from 3 to 9. Each row in the dataset represents a wine sample, with the columns detailing 11 physicochemical attributes and the quality score. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones).
Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
The final report can be found here
conda
(version 24.9.1 or higher)conda-lock
(version 2.5.7 or higher)- Python package
ucimlrepo
(version 0.0.7) jupyterlab
(version 4.2.0 or higher)nb_conda_kernels
(version 2.5.1 or higher)- Python and packages listed in
environment.yml
If you are using Windows or Mac, then please ensure that Docker Desktop is running. The user can be check if they have Docker by running the following command in a bash terminal:
docker --version
.
- Clone this GitHub repository.
- Make sure
docker-compose.yml
is using the image with the tag you wish to run it with. No changes are necessary if there is not a specific image tag you would like to run.
-
Run the following command in a terminal in the root of the local repository to use the Docker image to run the analysis:
docker compose up
This command will automatically start up a Jupyter Lab session using the image listed in the
docker-compose.yml
file and mount the current project in the Docker container. -
In the terminal, look for the Jupyter Lab link which starts with
http://127.0.0.1:8888/
. Copy and paste the URL into the browser to open up Jupyter Lab. -
Navigate to the root of this project on your computer using the command line and enter the following command to reset the project to a clean state (i.e., remove all files generated by previous runs of the analysis):
make clean
-
To run the analysis in its entirety, enter the following command in the terminal in the project root:
make all
Hit Ctrl + C
in the terminal to end the Jupyter Lab session. Run the following command after the session ends to free up the resources used by Docker: docker compose rm
.
Feedback and Contribution instruction can be found here
License can be found here