Inder Khera, Jenny Zhang, Jessica Kuo, Javier Martinez (alphabetically ordered)
In this study, we aim to develop a classification model using the logistic regression (LR) algorithm to predict whether a patient is expected to have diabetes or not. Our final model performed decent on an unseen test dataset, achieving an overall accuracy of 0.75. Out of 216 test cases, the model correctly identified 162. However, it made 54 incorrect predictions, of which, 13 are false positives - incorrectly classifying non-diabetic subjects to diabetic- and 41 are false negatives - fail to diagnose diabetes when the patient is actually diabetic. Such errors could either lead to unnecessary treatment or delayed treatment, with the latter having more serious consequences, so we recommend further refinement of the model before it is deployed for clinical use.
The data set that was used for the analysis of this project was created by Jack W Smith, JE Everhart, WC Dickson, WC Knowler, RS Johannes. The data set was sourced from the National Library of Medicine database from the National Institues of Health. Access to their respective analysis can be found here and access to the dataset can be found via kaggle (Dua & Graff,2017). Each row/obersvation from the dataset is an individual that identifies to be a part of the Pima (also known as The Akimel O'odham) Indeginous group, located mainly in the Central and Southern regions of the United States. Each observation recorded has summary statistics regarding features that include the Age, BMI, Blood Pressure, Number of Pregnancies, as well as The Diabetes Pedigree Function (which is a score that gives an idea about how much correlation is between person with diabetes and their family history).
The final report can be found here.
To replicate this analysis, follow the steps below. You can run the analysis using Docker.
Prerequisites: Please note that the instructions in this section require executing them in a Unix-based shell.
First, clone this GitHub repository and navigate to its root directory:
git clone https://github.com/UBC-MDS/diabetes_predictor_py.git
cd diabetes_predictor_py
Prerequisites: Install Docker and ensure it is running on your system.
-
Build and run the Docker container using the provided script:
chmod +x ./builders/docker_magic_builder.sh ./builders/docker_magic_builder.sh
This will set up the Conda environment inside a Docker container and build the Docker image.
-
Once the container is running, access the server by copy and paste the link to your browser. The link is shown in the terminal output that starts with http://127... (e.g., http://127.0.0.1:8888/lab?token={your_token})
-
Navigate to the root of this project on your computer using the command line and enter the following command to reset the project to a clean state (i.e., remove all files generated by previous runs of the analysis):
make clean
- To run the analysis in its entirety, enter the following command in the terminal in the project root:
make all
- Docker: Type
Ctrl
+C
in the terminal where you launched the container, and then typedocker compose rm
to shut down the container and clean up the resources
- conda (version 23.9.0 or higher)
- conda-lock (version 2.5.7 or higher)
- mamba (version 1.5.8 or higher)
- Python and packages listed in
environment.yml
-
Add the dependency to the
environment.yml
file on a new branch. If the package ispip
installed, it should also be added toDockerfile
with commandRUN pip install <package_name> = <version>
-
Run
conda-lock -k explicit --file environment.yml -p linux-64
to update theconda-linux-64.lock
file. -
Re-build the Docker image locally to ensure it builds and runs properly.
-
Push the changes to GitHub. A new Docker image will be built and pushed to Docker Hub automatically. It will be tagged with the SHA for the commit that changed the file and GitHub Actions should automatically update the tag in
docker-compose.yml
file to use the new container image. -
Send a pull request to merge the changes into the
main
branch.
The Diabetes Predictor report contained herein are licensed under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License See the license file for more information. If re-using/re-mixing please provide attribution and link to this webpage. The software code contained within this repository is licensed under the MIT license. See the license file for more information.
Dua, D., & Graff, C. (2017). Pima Indians Diabetes Database. UCI Machine Learning Repository. Retrieved from https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database/data.