This document contains records of the improvements we have made to our project based on the feedback we have received from the DSCI 522 teaching team and peer reviews.
Each section highlights the feedback provided, the state of the project before the change, and the state of the project after improvements.
Improvement by: Sepehr
Our environment.yml
file initially listed dependencies incorrectly. Feedback from TA informed us that we did not pin version directly using =
, but used >=
instead.
Upon review, a mistake in version specification for ucimlrepo
was also found (ucimlrepo==0.0.7
instead of ucimlrepo=0.0.7
).
- Updated
jupyterlab>=3.5
tojupyterlab=3.5
, but removed this dependency as it is redundant. - Updated
pip>=24.2
topip=24.2
- Updated
ucimlrepo==0.0.7
toucimlrepo=0.0.7
Feedback received
Initial state (before improvement)
After improvement
Link to commit changes:
Commit 8210652
Commit 82cd5b7
Commit 99889d7
Improvement by: Sepehr
The following feedback was provided by the TA for Milestone 1 regarding our Summary and Introduction section:
Initial state (before improvement)
After improvement
Summary Section:
- Limitations are clearly stated - "However, large portion of the dataset used in our analysis was synthetically created, while ensuring a balance dataset, this may introduce potential biases. Additionally, the data was collected from only three countries and would benefit to have data from more a diverse global population for a broader application".
- Deeper discussion of impact of work is stated - "our results show promising potential for application of machine learning in obesity diagnosis to aid healthcare professionals".
- Analysis question mentioned in the summary.
Introduction Section:
- A short description of the dataset and a brief discussion is now included in introduction. In the Data section under Methods, a more detailed discussion is added to elaborate the strengths and limitations of the dataset. This discussion addresses the TA's feedback regarding why this dataset is chosen.
- Research question is clearly stated and relevance of input variables is discussed.
These changes were done through series of commits, some are listed below. However as this could be tedious, screenshots of the summary and introduction section are available in these paths from the root directory (Alternatively, the report can be viewed instead):
path for Summary screenshot: img/Improvement_2_Summary.png
path for Introduction screenshot: img/Improvement_2_Intro.png
Commit 5288ac1
Commit 57a6658
Improvement by: Yun Zhou
The following feedback was provided by the peers in the other groups for Milestone 3:
Initial state (before improvement)
We only included accuracy in our results table.
After improvement
We added precision and recall for the support metrics to evaluate the final model. As the focus is not on specific level of the target variable, we calculated the average precision and average recall
across all target levels. This ensures a more comprehensive evaluation of the model's performance.
Commit link for this improvement:
Commit 49908af
Improvement by: Yun Zhou
The following feedback was provided by another peer in the another group for Milestone 3:
Feedback received
Initial state (before improvement)
After improvement
We removed all of the duplicate titles for the figures which are already included in description.
Commit link for this improvement:
Commit 4ba64ac
Improvement by: Yun Zhou
The following feedback was provided by the TA for Milestone 2 regarding our Data Validation section and breaking the golden rule:
Feedback received
Initial state (before improvement)
After improvement
We fixed the data leakage issue in data validation correlation check steps. The correlation checks are now performed only on the training set after splitting the dataset. This ensures that the test set
remains untouched and the golden rule is followed.
Commit link for this improvement:
Commit 2e759d8
Improvement by: Zanan
The following feedback was provided through peer reviews after Milestone 3. In the README.md file the instructions for developers did not include conda-lock
and conda
as dependencies in case additional dependencies are needed to be added.
After improvement
We added conda
and conda-lock
version under developer notes section.
Commit link for this improvement:
Commit ccc753a
Improvement by: Zanan
The following feedback was provided through peer reviews after Milestone 3. It refers to the lack of visual aids, apart from Figure 1, in our report to communicate more effectively.
Initial state (before improvement)
Initially, we only had Figure 1, which referred to the class distribution of the target. This lack of visual represetation made the report less comprehensible.
After improvement
We added 2 more distribution plots for categorical and numerical features and a brief summary. The plots were generated through our EDA. \
Commit link for this improvement:
Commit 3b33f68
Improvement by: Sepehr
The following feedback was provided by the TA for Milestone 1 regarding the Method section:
Feedback received
Initial state (before improvement)
As mentioned in the feedback from TA, our Method section lacked clarity regarding metrics used for HP optimization, why dataset was balanced, which features were used in the mode, why synthetic data was generated, and limitation of SMOTE filter.
After improvement
Following were included in the Method section:
- Clearly specified accuracy as a metric for hypterparameter optimization.
- Mentioned that dataset is balanced and added plot
- Explained all features were used in model training and why
- Added explanation why synthetic data was used, emphasizing the need for class balance
- Clarified SMOTE filter as limitation.
These changes were done through series of commits listed below. As it could be tedious to go through all commits, screenshots of the Method section are available in these paths from the root directory (alternatively, the report html can be viewed instead) to review method section:
path for Method - Data screenshot: img/improvement_8_method_data.png
path for Method - analysis screenshot: img/improvement_8_method_analysis.png
Commit link for this improvement:
Commit 06fab17
Commit 844ac5a
Commit 5288ac1
Improvement by: Sepehr
Feedback was provided by TA in office hours that after milestone 3 some values in the report, specifically number of rows and features of the dataset, were hardcoded rather than using inline code.
Initial state (before improvement)
After improvement
We used inline code to refer to shape of the data frame. This improves accuracy by reducing errors, improves transparency, and promotes reproducibility.
\
Commit link for this improvement:
Commit 1796b22