Skip to content

DSCI 522 Group 15 Project: Diabetes Classification Analysis. This repository showcases work on building a classification model to predict diabetes using the Pima Indians Diabetes dataset from Kaggle. The project is designed to follow best practices for data analysis and reproducibility.

License

Notifications You must be signed in to change notification settings

UBC-MDS/diabetes_predictor_py

Repository files navigation

Diabetes Predictor

Authors

Inder Khera, Jenny Zhang, Jessica Kuo, Javier Martinez (alphabetically ordered)

About

In this study, we aim to develop a classification model using the logistic regression (LR) algorithm to predict whether a patient is expected to have diabetes or not. Our final model performed decent on an unseen test dataset, achieving an overall accuracy of 0.75. Out of 216 test cases, the model correctly identified 162. However, it made 54 incorrect predictions, of which, 13 are false positives - incorrectly classifying non-diabetic subjects to diabetic- and 41 are false negatives - fail to diagnose diabetes when the patient is actually diabetic. Such errors could either lead to unnecessary treatment or delayed treatment, with the latter having more serious consequences, so we recommend further refinement of the model before it is deployed for clinical use.

The data set that was used for the analysis of this project was created by Jack W Smith, JE Everhart, WC Dickson, WC Knowler, RS Johannes. The data set was sourced from the National Library of Medicine database from the National Institues of Health. Access to their respective analysis can be found here and access to the dataset can be found via kaggle (Dua & Graff,2017). Each row/obersvation from the dataset is an individual that identifies to be a part of the Pima (also known as The Akimel O'odham) Indeginous group, located mainly in the Central and Southern regions of the United States. Each observation recorded has summary statistics regarding features that include the Age, BMI, Blood Pressure, Number of Pregnancies, as well as The Diabetes Pedigree Function (which is a score that gives an idea about how much correlation is between person with diabetes and their family history).

Report

The final report can be found here.

Software Dependencies

Usage

To replicate this analysis, follow the steps below. You can run the analysis using Docker.

Prerequisites: Please note that the instructions in this section require executing them in a Unix-based shell.


Setup

First, clone this GitHub repository and navigate to its root directory:

git clone https://github.com/UBC-MDS/diabetes_predictor_py.git
cd diabetes_predictor_py

Run Analysis

Prerequisites: Install Docker and ensure it is running on your system.

  1. Build and run the Docker container using the provided script:

    chmod +x ./builders/docker_magic_builder.sh
    ./builders/docker_magic_builder.sh

    This will set up the Conda environment inside a Docker container and build the Docker image.

  2. Once the container is running, access the server by copy and paste the link to your browser. The link is shown in the terminal output that starts with http://127... (e.g., http://127.0.0.1:8888/lab?token={your_token}) docker container link

  3. Navigate to the root of this project on your computer using the command line and enter the following command to reset the project to a clean state (i.e., remove all files generated by previous runs of the analysis):

make clean
  1. To run the analysis in its entirety, enter the following command in the terminal in the project root:
make all

Clean up

  1. Docker: Type Ctrl + C in the terminal where you launched the container, and then type docker compose rm to shut down the container and clean up the resources

Developer Dependencies

  • conda (version 23.9.0 or higher)
  • conda-lock (version 2.5.7 or higher)
  • mamba (version 1.5.8 or higher)
  • Python and packages listed in environment.yml

Adding a new dependency

  1. Add the dependency to the environment.yml file on a new branch. If the package is pip installed, it should also be added to Dockerfile with command RUN pip install <package_name> = <version>

  2. Run conda-lock -k explicit --file environment.yml -p linux-64 to update the conda-linux-64.lock file.

  3. Re-build the Docker image locally to ensure it builds and runs properly.

  4. Push the changes to GitHub. A new Docker image will be built and pushed to Docker Hub automatically. It will be tagged with the SHA for the commit that changed the file and GitHub Actions should automatically update the tag in docker-compose.yml file to use the new container image.

  5. Send a pull request to merge the changes into the main branch.

License

The Diabetes Predictor report contained herein are licensed under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License See the license file for more information. If re-using/re-mixing please provide attribution and link to this webpage. The software code contained within this repository is licensed under the MIT license. See the license file for more information.

References

Dua, D., & Graff, C. (2017). Pima Indians Diabetes Database. UCI Machine Learning Repository. Retrieved from https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database/data.

About

DSCI 522 Group 15 Project: Diabetes Classification Analysis. This repository showcases work on building a classification model to predict diabetes using the Pima Indians Diabetes dataset from Kaggle. The project is designed to follow best practices for data analysis and reproducibility.

Resources

License

Code of conduct

Stars

Watchers

Forks