Merge pull request #125 from UBC-MDS/develop

Final Repo
UBC-MDS · Dec 17, 2024 · 03905a2 · 03905a2
2 parents e5938f2 + 9b4ede4
commit 03905a2
Show file tree

Hide file tree

Showing 44 changed files with 906 additions and 1,996 deletions.
diff --git a/.bash_history b/.bash_history
diff --git a/.github/workflows/docker-publish.yml b/.github/workflows/docker-publish.yml
@@ -11,6 +11,7 @@ on:
     paths: 
       - 'Dockerfile'
       - 'conda-linux-64.lock'
+      - 'requirements.txt'
 
 jobs:
   push_to_registry:

diff --git a/.local/share/jupyter/runtime/jpserver-7-open.html b/.local/share/jupyter/runtime/jpserver-7-open.html
diff --git a/.local/share/jupyter/runtime/jpserver-7.json b/.local/share/jupyter/runtime/jpserver-7.json
diff --git a/.local/share/jupyter/runtime/jupyter_cookie_secret b/.local/share/jupyter/runtime/jupyter_cookie_secret
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,71 @@
+---
+editor_options: 
+  markdown: 
+    wrap: 72
+---
+
+Revisions:
+
+Who: Merari Santana
+
+What was addressed:
+
+-   Scripts on README file were not running Description of Revision: I
+    revised the instructions for running Make file. This runs all the
+    scripts correctly. Evidence:
+    <https://github.com/UBC-MDS/heart-failure-analysis/commit/3f23b4e431508388575169556cc8aa3a8e0a0646>
+
+-   Improved accessibility to our report Description of Revision: I
+    deployed Github pages so that our README file has a direct link to
+    our HTML report. Evidence:
+    <https://github.com/UBC-MDS/heart-failure-analysis/commit/7e22dd6dc250c11948aa87be384a8f9c15fec87a>
+
+-    Change acronymns in final report and delete bullet points
+    Description of Revision: I changed the acronyms in our qmd file and
+    deleted bullet points. These changes were rendered to our pdf and
+    html files. Evidence:
+    <https://github.com/UBC-MDS/heart-failure-analysis/commit/b91ca5a3874067d447d9646090028011784b85ba>
+    <https://github.com/UBC-MDS/heart-failure-analysis/commit/7a12b5c145fc4dc222c043461186f4d0b4b43c99>
+
+Who: Gurmehak Kaur
+
+What was addressed:
+
+-   Improve the project folder structure Description of Revision: I cleaned up and improved the project’s folder structure by organizing files into dedicated folders that were earlier missing in our repo: `reports/` for generated summaries, `results/` with subfolders for tables and figures for visualizations, `scripts/` for executable workflows and `src/` for abstract functions. This streamlined structure improves clarity and project maintainability.
+    Evidence:
+    <https://github.com/UBC-MDS/heart-failure-analysis/commit/87eadd9b89b44e0c49dea8433a9b300577dab760>
+    <https://github.com/UBC-MDS/heart-failure-analysis/commit/5517cf4a60afb6bf6afef3c43c2f820a9909862c> 
+
+Who: Ke Gao
+
+What was addressed:
+
+-   Improve Automatic Numbering of Figures in the Report Description of
+    Revision: I improved automatic numbering of figures in the report.
+    Evidence:
+    <https://github.com/UBC-MDS/heart-failure-analysis/pull/106>
+
+-   Improve Automatic Numbering of Tables in the Report Description of
+    Revision: I improved automatic numbering of tables in the report.
+    Evidence:
+    <https://github.com/UBC-MDS/heart-failure-analysis/pull/106>
+
+Who: Yuhan Fan
+
+What was addressed:
+
+-   Updated README.me with following:
+
+    -   the 'About' section of README.md with most resent results
+        metrics from our final report, and fixed any grammar errors.
+
+    -   Deleted bullet point and capitalized "contributors" in
+        README.md.
+
+    -    Added GitHub repository link under 'Usage' - 'Setup' section.
+
+    -   Added example screenshot image to 'Running the analysis'
+        section.
+
+    -   Evidence:
+        <https://github.com/UBC-MDS/heart-failure-analysis/pull/120>
diff --git a/Dockerfile b/Dockerfile
@@ -9,6 +9,8 @@ USER root
 RUN sudo apt update \
     && sudo apt install -y lmodern
 
+RUN apt-get update && apt-get install -y build-essential make
+
 USER $NB_UID
 
 RUN mamba update --quiet --file /tmp/conda-linux-64.lock
@@ -17,8 +19,7 @@ RUN mamba clean --all -y -f
 RUN pip install --no-cache-dir -r /tmp/requirements.txt
 RUN pip cache purge
 
+
 RUN fix-permissions "${CONDA_DIR}"
 RUN fix-permissions "/home/${NB_USER}"
 
-RUN pip install deepchecks==0.18.1
-
diff --git a/Makefile b/Makefile
@@ -1,57 +1,73 @@
 .PHONY: all clean
 
-all: report/heart_failure_analysis.html report/heart_failure_analysis.pdf
+all: data/raw/heart_failure_clinical_records.data \
+	data/processed/heart_failure_train.csv \
+	results/figures/correlation_heatmap.png \
+	results/models/pipeline.pickle results/figures/training_plots \
+	results/tables/confusion_matrix.csv \
+	results/tables/test_scores.csv \
+	reports/heart-failure-analysis.html \
+	reports/heart-failure-analysis.pdf
 
 # Download and convert data
-data/raw/heart_failure_clinical_records.data : scripts/download_and_convert.py
+data/raw/heart_failure_clinical_records.data: scripts/download_and_convert.py
 	python scripts/download_and_convert.py \
 		--url="https://archive.ics.uci.edu/static/public/519/heart+failure+clinical+records.zip" \
 		--write_to=data/raw
 
 # Process and analyze data
 data/processed/heart_failure_train.csv data/processed/heart_failure_test.csv : scripts/process_and_analyze.py data/raw/heart_failure_clinical_records.data
 	python scripts/process_and_analyze.py \
-		--file_path=data/raw/heart_failure_clinical_records.data \
-		--data-to=data/processed
+		--file_path="data/raw/heart_failure_clinical_records_dataset_converted.csv" \
+		--output_dir=data/processed
 
 # Perform correlation analysis
 results/figures/correlation_heatmap.png : scripts/correlation_analysis.py data/processed/heart_failure_train.csv data/processed/heart_failure_test.csv
 	python scripts/correlation_analysis.py \
 		--train_file=data/processed/heart_failure_train.csv \
 		--test_file=data/processed/heart_failure_test.csv \
-		--output_file=results/figures/correlation_heatmap.png
+		--output_file="./results/figures/heatmap.png"
 
 # Train and evaluate the model
-results/models/pipeline.pickle results/figures/training_plots : scripts/modelling.py data/processed/heart_failure_train.csv
-	python scripts/modelling.py \ 
-		--training-data=data/processed/heart_failure_train.csv \
-		--pipeline-to=results/models \
-		--plot-to=results/figures \
-		--seed=123
-
-results/tables/test_evaluation.csv : scripts/model_evaluation.py data/processed/heart_failure_test.csv results/models/pipeline.pickle
+results/models/pipeline.pickle results/figures/training_plots: data/processed/heart_failure_train.csv
+	python scripts/modelling.py \
+		--training-data "./data/processed/heart_failure_train.csv" \
+		--pipeline-to "results/models" \
+		--plot-to "results/figures" \
+		--table-to "results/tables" \
+		--seed 123
+
+results/tables/confusion_matrix.csv results/tables/test_scores.csv: scripts/model_evaluation.py data/processed/heart_failure_test.csv results/models/pipeline.pickle
 	python scripts/model_evaluation.py \
-		--scaled-test-data=data/processed/heart_failure_test.csv \
-		--pipeline-from=results/models/pipeline.pickle \
-		--results-to=results/tables
+		--scaled-test-data "data/processed/heart_failure_test.csv" \
+		--pipeline-from "results/models/pipeline.pickle" \
+		--results-to "results/tables"
 
 # Build HTML and PDF reports
-report/heart_failure_analysis.html report/heart_failure_analysis.pdf : report/heart_failure_analysis.qmd \
-results/models/pipeline.pickle \
-results/figures/heatmap.html \
-results/figures/training_plots \
-results/tables/test_evaluation.csv
-	quarto render report/heart_failure_analysis.qmd --to html
-	quarto render report/heart_failure_analysis.qmd --to pdf
+# Rule to generate HTML
+reports/heart-failure-analysis.html:
+	quarto render reports/heart-failure-analysis.qmd --to html --embed-resources --standalone
+
+# Rule to generate PDF
+reports/heart-failure-analysis.pdf:
+	quarto render reports/heart-failure-analysis.qmd --to pdf
+
 
 # Clean up analysis
 clean:
-	rm -rf data/raw/*
-	rm -f results/data/processed/heart_failure_train.csv \
-		results/data/processed/heart_failure_test.csv \
-		results/models/pipeline.pickle \
-		results/figures/heatmap.html \
-		results/figures/training_plots \
-		results/tables/test_evaluation.csv \
-		report/heart_failure_analysis.html \
-		report/heart_failure_analysis.pdf
+	rm -rf \
+		data/processed/* \
+		results/figures/* \
+		results/img/* \
+		results/models/* \
+		results/pipeline/* \
+
+	rm -f \
+		results/tables/test_scores.csv \
+		results/tables/confusion_matrix.csv \
+		results/tables/confusion_matrix.csv \
+		results/tables/logistic_regression_coefficients.csv \
+		reports/heart-failure-analysis.html \
+		reports/heart-failure-analysis.pdf
+
+
diff --git a/README.md b/README.md
@@ -1,13 +1,17 @@
 # Heart Failure Analysis
 
--   contributors: Yuhan Fan, Gurmehak Kaur, Ke Gao, Merari Santana
+Contributors: Yuhan Fan, Gurmehak Kaur, Ke Gao, Merari Santana
 
 ## About
 
-In this project, we attempt to build a classification model using logistic regression algorithm to predict patient mortality risk after surviving a heart attack using their medical records. Using patient test results, the final classifier achieves an accuracy of 81.6%. The model’s precision of 70.0% suggests it is moderately conservative in predicting the positive class (death), minimizing false alarms.More importantly, the recall of 73.68% ensures the model identifies the majority of high-risk patients, reducing the likelihood of missing true positive cases, however, there is still room for a lot of improvement, particularly in aiming to maximise recall by minimising False Negatives. The F1-score of 0.71 reflects a good balance between precision and recall, highlighting the model’s robustness in survival prediction. While promising, further refinements are essential for more reliable predictions and effectively early intervention.
+In this project, we attempt to build a classification model using logistic regression algorithm to predict patient mortality risk after surviving a heart attack using their medical records. Using patient test results, the final classifier achieves an accuracy of 0.82. The model’s precision of 0.70 suggests it is moderately conservative in predicting the positive class (death), minimizing false alarms. More importantly, the recall of 0.74 ensures the model identifies the majority of high-risk patients, reducing the likelihood of missing true positive cases, however, there is still room for a lot of improvement, particularly in aiming to maximise recall by minimising False Negatives. The F1-score of 0.72 reflects a good balance between precision and recall, highlighting the model’s robustness in survival prediction. While promising, further refinements are essential for more reliable predictions and effectively early intervention.
 
 The data set used in this project was created by D. Chicco, Giuseppe Jurman in 2020. It was sourced from the UCI Machine Learning Repository and can be found [here](https://archive.ics.uci.edu/dataset/519/heart+failure+clinical+records). Each row in the data set represents the medical records of 299 patients who had heart failure, collected during their follow-up period, where each patient profile has 13 clinical features（age, anaemia, diabetes, platelets, etc.).
 
+## Report
+
+The final report can be found [here](https://ubc-mds.github.io/heart-failure-analysis/reports/heart-failure-analysis.html).
+
 ## Dependencies
 
 -   Docker
@@ -20,21 +24,30 @@ The data set used in this project was created by D. Chicco, Giuseppe Jurman in 2
 
 > If you are using Windows or Mac, make sure Docker Desktop is running.
 
-1.  Clone this GitHub repository.
+1.  Clone this [GitHub repository](https://github.com/UBC-MDS/heart-failure-analysis/tree/main).
 
 ### Running the analysis
 
-1.  Navigate to the root of this project on your computer using the command line and enter the following command:
+2.  Navigate to the root of this project on your computer using your local terminal and then enter the following command:
 
 ```         
 docker compose up
 ```
 
-2.  In the terminal, look for a URL that starts with [`http://127.0.0.1:8888/lab?token=`](http://127.0.0.1:8888/lab?token=) (for an example, see the highlighted text in the terminal below). Copy and paste that URL into your browser.
+3.  In the terminal output, look for a URL that starts with `http://127.0.0.1:8888/lab?token=`. (for an example, see the highlighted text in the terminal below). Copy and paste that URL into your browser.
+
+
+4.  Navigate to the root of this project on your computer using the command line and enter the following command to reset the project to a clean state (i.e., remove all files generated by previous runs of the analysis):
+
+```         
+make clean
+```
 
-<img src="img/jupyter-container-web-app-launch-url.png" width="400"/>
+5.  To run the analysis in its entirety, enter the following command in the terminal in the project root:
 
-3.  To run the analysis, open `heart-failure-analysis.ipynb` in Jupyter Lab you just launched and under the "Kernel" menu click "Restart Kernel and Run All Cells...".
+```         
+make all
+```
 
 ### Clean up
 
@@ -61,37 +74,6 @@ docker compose up
 
 6.  Send a pull request to merge the changes into the `main` branch.
 
-### Calling scripts
-
-To run the analysis, open a terminal and run the following commands and their respective arguments:
-
-```         
-python scripts/download_and_convert.py \
-  --url "https://archive.ics.uci.edu/static/public/519/heart+failure+clinical+records.zip"
-
-python scripts/process_and_analyze.py \
-  --file_path "../data/heart_failure_clinical_records_dataset_converted.csv"
-  
-python scripts/correlation_analysis.py \
-  --train_file "./data/processed/heart_failure_train.csv" \
-  --test_file "./data/processed/heart_failure_test.csv" \
-  --output_file "./results/figures/heatmap.html"
-  
-python scripts/modelling.py \
-  --training-data "./data/processed/heart_failure_train.csv" \
-  --pipeline-to "results/pipeline" \
-  --plot-to "results/figures" \
-  --seed 123
-  
-python scripts/model_evaluation.py \
-    --scaled-test-data=data/processed/heart_failure_test.csv \
-    --pipeline-from=results/pipeline/heart_failure_model.pickle \
-    --results-to=results/figures
-    
-quarto render heart-failure-analysis.qmd --to html
-quarto render heart-failure-analysis.qmd --to pdf
-```
-
 ## License
 
 This dataset is licensed under a [Creative Commons Attribution 4.0 International (CC BY 4.0) license](https://creativecommons.org/licenses/by/4.0/legalcode).

diff --git a/data/.gitkeep b/data/.gitkeep
diff --git a/data/heart+failure+clinical+records.zip → data/raw/heart+failure+clinical+records.zip b/data/heart+failure+clinical+records.zip → data/raw/heart+failure+clinical+records.zip
diff --git a/...eart_failure_clinical_records_dataset.csv → ...eart_failure_clinical_records_dataset.csv b/...eart_failure_clinical_records_dataset.csv → ...eart_failure_clinical_records_dataset.csv
diff --git a/...re_clinical_records_dataset_converted.csv → ...re_clinical_records_dataset_converted.csv b/...re_clinical_records_dataset_converted.csv → ...re_clinical_records_dataset_converted.csv
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -1,6 +1,7 @@
 services:
   jupyter-notebook:
-    image:  gur5/heart-failure-prediction:7de3b28
+    image:  gur5/heart-failure-prediction:fe61672
+
     ports:
       - "8888:8888"
     volumes:

diff --git a/environment.yml b/environment.yml
@@ -13,9 +13,9 @@ dependencies:
   - joblib=1.3.1
   - pip=24.0
   - pytest=8.3.4
-  - pip:
-      - altair-ally==0.1.1
-      - vega-datasets==0.9.0
-      - vegafusion==1.6.9
-      - deepchecks==0.18.1
-      - pandera==0.20.4
+  # - pip:
+  #     - altair-ally==0.1.1
+  #     - vega-datasets==0.9.0
+  #     - vegafusion==1.6.9
+  #     - deepchecks==0.18.1
+  #     - pandera==0.20.4
diff --git a/reports/heart-failure-analysis.html b/reports/heart-failure-analysis.html