The outline of the project is to predict whether a patient is prone to risk of a heart attack or not, using different health parameters.
The Diabetes prediction dataset is a collection of medical and demographic data from patients, along with their diabetes status (positive or negative). The data includes features such as age, gender, body mass index (BMI), hypertension, heart disease, smoking history, HbA1c level, and blood glucose level. This dataset can be used to build machine learning models to predict diabetes in patients based on their medical history and demographic information. This can be useful for healthcare professionals in identifying patients who may be at risk of developing diabetes and in developing personalized treatment plans. Additionally, the dataset can be used by researchers to explore the relationships between various medical and demographic factors and the likelihood of developing diabetes.
This dataset provides a comprehensive array of features relevant to heart health and lifestyle choices, encompassing patient-specific details such as age, gender, cholesterol levels, blood pressure, heart rate, and indicators like diabetes, family history, smoking habits, obesity, and alcohol consumption. Additionally, lifestyle factors like exercise hours, dietary habits, stress levels, and sedentary hours are included. Medical aspects comprising previous heart problems, medication usage, and triglyceride levels are considered. Socioeconomic aspects such as income and geographical attributes like country, continent, and hemisphere are incorporated. The dataset, consisting of around 7000 records from patients around the globe, culminates in a crucial binary classification feature denoting the presence or absence of a heart attack risk, providing a comprehensive resource for predictive analysis and research in cardiovascular health.
python 3.10
First and foremost, the repo needs to cloned to local for usage. This can be achieved using:
git clone
The data used for training this model is stored in /data/diabetes_prediction_dataset.csv
in the repo.
Build Docker Image
docker build -t {build-tag} .
Run the docker image
docker run -it --rm -p 9696:9696 {build-tag}
: Specifies any user-defined tag for docker image. eg. diabetes-risk-score:latest
By default, the patient parameters are set at the following for test service:
patient = {
"hypertension": 0,
"heart_disease": 0,
"smoking_history": "current",
"bmi": 25.31,
To test the model with specific input and check the prediction probablity value.
To test the model using API endpoint either after starting gunicorn loclaly or after docker deployment.
Locally, user shoudl be able to get a similar output to the one shown below upon running all steps successfully.
CPU : 2 or more Container or virtual machine manager such as Docker,Virtual Box etc
###Installation Install instructions for various platforms are located here : Below steps I listed on for Mac.
brew install minikube
brew install kubectl
minikube start
eval $(minikube docker-env)
minikube cache add python:3.10-slim
docker build -t diabetest-risk-score .
Create deployment and Expose it on port 9696.
kubectl create -f deployment.yaml
kubectl expose deployment flaskapi-deployment --type=NodePort --port=9696
I took sample deployment.yaml for FlaskAPI and updated with my image. Alternatively, you can use image to create deployment. Below are the details.
kubectl create deployment flask-api --image=diabetest-risk-score:latest
kubectl expose deployment flask-api --type=NodePort --port=9696
The easiest way to access this service is to let minikube launch a web browser for you:
minikube service flask-api
Alternatively, use kubectl to forward the port:
kubectl port-forward service/flask-api 7080:9696
Pause Kubernetes without impacting deployed applications:
minikube pause
Unpause a paused instance:
minikube unpause
Halt the cluster:
minikube stop
The project has been created as part of ML ZOOMCAMP with the help of a colaborative slack community of DataTalks and specially Alexey.
Trained model on Logistic, Decision Tree , Random Forest and XGBoost. Though XGBoost is have very slightly high score than random forest. Not getting right predictions when I test locally with different test data. The data I choose as host of class Imbalance. Added class_weight to correct the balance but still need to explore why XGBoost is not predicting right always even with high AUC score.