Feedback #1

github-classroom · 2024-09-30T04:33:33Z

👋! GitHub Classroom created this pull request as a place for your teacher to leave feedback on your work. It will update automatically. Don’t close or merge this pull request, unless you’re instructed to do so by your teacher.
In this pull request, your teacher can leave comments and feedback on your code. Click the Subscribe button to be notified if that happens.
Click the Files changed or Commits tab to see all of the changes pushed to the default branch since the assignment started. Your teacher can see this too.

Notes for teachers

Use this PR to leave feedback. Here are some tips:

Click the Files changed tab to see all of the changes pushed to the default branch since the assignment started. To leave comments on specific lines of code, put your cursor over a line of code and click the blue + (plus sign). To learn more about comments, read “Commenting on a pull request”.
Click the Commits tab to see the commits pushed to the default branch. Click a commit to see specific changes.
If you turned on autograding, then click the Checks tab to see the results.
This page is an overview. It shows commits, line comments, and general comments. You can leave a general comment below.
For more information about this pull request, read “Leaving assignment feedback in GitHub”.

Subscribed: @park-jaeuk @choitaesoon @jinnk0 @Cyberger @JaeEunSeo

- submission_to_csv : 최종 제출용 csv 생성 - mae_to_csv : 실험별로 MAE score를 확인하기 위한 csv 생성

ilovemyminutes · 2024-11-03T22:48:07Z

code/features/clustering_features.py

+from sklearn.neighbors import KDTree
+from sklearn.cluster import KMeans
+
+#from geopy.distance import great_circle


사용하지 않는 코드는 지워야 합니다.

ilovemyminutes · 2024-11-03T22:50:23Z

code/features/clustering_features.py

+### 클러스터링
+
+# clustering 함수
+def clustering(total_df, info_df, feat_name, n_clusters=20):


함수명은 동사형으로 작성하는 게 좋습니다.

함수에 대한 설명은 함수 바깥이 아닌 안에 적습니다.

def make_clustering(total_df, info_df, feat_name, n_clusters=20): """clustring 함수""" ...

ilovemyminutes · 2024-11-03T22:53:55Z

code/features/count_features.py

+# n 개월 동일한 아파트 거래량 함수
+def transaction_count_function(train_data: pd.DataFrame, valid_data: pd.DataFrame, test_data: pd.DataFrame, months: int = 3) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
+    # 파일 경로 설정
+    transaction_folder = os.path.join(Directory.root_path, 'data/transaction_data')


pathlib을 사용하면 더 깔끔하게 작성할 수 있습니다.

035 파일 경로를 객체로 다루려면? ― pathlib - https://wikidocs.net/110182

ilovemyminutes · 2024-11-03T22:54:26Z

code/features/count_features.py

+    total_data[f'transaction_count_last_{months}_months'] = 0
+
+    # 위도, 경도, 건축 연도로 그룹화
+    grouped = total_data.groupby(['latitude', 'longitude', 'built_year', 'area_m2'])


변수명이 그리 좋아보이지 않습니다. 좀더 명시적으로 작성해주세요.

ilovemyminutes · 2024-11-03T22:57:00Z

code/features/count_features.py

+def create_school_counts_within_radius_by_school_level(train_data: pd.DataFrame, valid_data: pd.DataFrame, test_data: pd.DataFrame, radius : float = 0.02) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
+    school_info = Directory.school_info
+    seoul_area_school = school_info[(school_info['latitude'] >= 37.0) & (school_info['latitude'] <= 38.0) &
+                                     (school_info['longitude'] >= 126.0) & (school_info['longitude'] <= 128.0)]


thresholding을 위한 상수값이 나오는데, 이건 따로 변수로 빼주고 사용하는 게 좋아보입니다.

SCHOOL_LATITUDE_BOUNDARY = (37.0, 38.0)
SCHOOL_LONGITUDE_BOUNDARY = (126.0, 128.0)

ilovemyminutes · 2024-11-03T23:01:44Z

code/utils/constant_utils.py

+        "enable_categorical": True
+    }
+
+class Directory:


클래스 변수로 판다스 데이터프레임을 할당해두는 건 그리 좋은 것 같지 않습니다. 인스턴스 변수로 할당해두는 게 더 효율적일 것 같습니다.

Directory라는 클래스명이 그리 적절해보이지 않습니다.

ilovemyminutes · 2024-11-03T23:02:22Z

code/features/count_features.py

+# 반경 내 지하철 개수 함수
+def create_subway_within_radius(train_data: pd.DataFrame, valid_data: pd.DataFrame, test_data: pd.DataFrame, radius : float = 0.01) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
+    # subwayInfo에는 지하철 역의 위도와 경도가 포함되어 있다고 가정
+    subwayInfo = Directory.subway_info


파이썬에서는 변수명으로 camelcase를 사용하지 않습니다.

ilovemyminutes · 2024-11-03T23:04:15Z

code/features/deposit_features.py

@@ -0,0 +1,169 @@
+from sklearn.cluster import KMeans
+from utils.constant_utils import Config, Directory


불필요한 import 는 제거해주세요.

ilovemyminutes · 2024-11-03T23:10:25Z

code/handler/cnn_mlp_datasets.py

+        if self.mode=="train" or self.mode=="valid":
+            return (self.X[idx], self.y[idx])
+        else:
+            return (self.X[idx])


기본적으로 함수의 output 형태는 한가지인 것이 좋습니다. 저라면 GridDataset에 train, valid mode로 분기하지 않고, TrainGridDataset, ValidGridDataset를 따로 따로 구현했을 것 같아요.

ilovemyminutes · 2024-11-03T23:13:39Z

code/handler/cnn_mlp_datasets.py

+        self.mode = mode
+        df = common_utils.merge_data(Directory.train_data, Directory.test_data)
+
+        ### 클러스터 피처 apply


여기 아래부터 나와있는 부분들은, 결국 X, y의 인스턴스 변수를 할당하기 위한 것으로 보이는데요, 이러한 전처리 과정은 MLPDataset의 바깥에서 해결되었어야 할 부분이라고 생각합니다. Dataset의 역할은, 적절히 모델 입력에 맞는 샘플을 출력하는 것이기 때문입니다.

ilovemyminutes · 2024-11-03T23:14:54Z

code/handler/feature_engineering.py

+
+from features import clustering_features, count_features, deposit_features, distance_features, other_features
+
+def feature_engineering(train_data_ : pd.DataFrame , valid_data_ : pd.DataFrame , test_data_ : pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:


def apply_feature_engineering 등의 동사형 함수명이 더 좋아보입니다.

ilovemyminutes · 2024-11-03T23:18:20Z

code/handler/preprocessing.py

+    # 범주형 변수에 대해 One-Hot Encoding 적용
+    train_data_encoded = pd.get_dummies(train_data, columns=categorical_cols, drop_first=True)
+    valid_data_encoded = pd.get_dummies(valid_data, columns=categorical_cols, drop_first=True)
+    test_data_encoded = pd.get_dummies(test_data, columns=categorical_cols, drop_first=True)


이번 프로젝트에서는 이렇게 get_dummies를 쓰면 물론 원핫인코딩이 가능하지만, 추후에 더 복잡한 프로젝트를 하게 되신다면 아래와 같은 상황을 염두에 두시면 더 좋을 것 같습니다.

(1) 이 데이터 크기(=행 갯수)가 1억개가 넘어가도 이 코드는 잘 동작할까?

(2) 범주형 변수의 범주 갯수가 1억개가 넘어가도 이 코드는 잘 동작할까?

(1), (2) 중 하나라도 우려가 된다면, 어떻게 코드를 짜야 할까? 어떻게 원핫 인코딩 해야할까?

ilovemyminutes · 2024-11-03T23:18:48Z

code/main.py

+
+    # 로그 변환
+    # train_data_ = pre.log_transform(train_data_, 'deposit')
+    # valid_data_ = pre.log_transform(valid_data_, 'deposit')


불필요한 코드는 모두 제거해야 합니다.

ilovemyminutes · 2024-11-03T23:19:25Z

code/main.py

+
+
+    ### 최종 dataset 구축(top_20_features)
+    selected_columns = Config.TOP_20_FEATURES


굳이 새로운 변수로 할당할 필요는 없었을 것 같아요

ilovemyminutes · 2024-11-03T23:21:17Z

code/models/CombinedModel.py

+        x = self.pool(self.relu(self.conv2(x)))  # (N, 32, 21, 14) -> (N, 64, 10, 7)
+
+        # Flatten the output of the conv layers
+        x = x.view(-1, 64 * 10 * 7)  # Flatten: (N, 64, 10, 7) -> (N, 64 * 10 * 7)


그냥 계산된 결과를 넣어두는 게 좋습니다. 불필요한 2번의 곱연산을 매 forward마다 할 필요는 없어 보여요.

ilovemyminutes · 2024-11-03T23:23:07Z

code/models/DL_tabtransformer/trainer.py

+
+class TabTransformerTrainer:
+    def __init__(self, model, optimizer, loss_fn, device):
+        """TabTransformerTrainer 초기화."""


굳이 없어도 될 docstring입니다. __init__이라는 매직 메소드 이름만으로 충분히 이해돼요.

ilovemyminutes · 2024-11-03T23:25:00Z

code/models/SeedEnsemble.py

+
+    def train(self, train_data, dataset_type):
+        for seed in self.seeds:
+            model_ = self.model_class(self.spatial_weight_matrix, seed=seed)


불필요한 언더스코어(_)는 달지 말아주세요.

ilovemyminutes · 2024-11-03T23:27:28Z

code/models/SpatialWeightMatrix.py

+        dataset_type : train, valid, test, train_total # train_total : train + valid 통합해서 훈련 시 사용
+        '''
+        dir_path = os.path.join(self.base_save_directory, dataset_type)
+        os.makedirs(dir_path, exist_ok=True)


함수의 목적을 감안하면, 디렉토리 생성 부분은 없어야 할 것 같고, 디렉토리 생성 부분을 포함시키고자 한다면 함수명을 수정해야 합니다.

ilovemyminutes · 2024-11-03T23:28:02Z

code/models/SpatialWeightMatrix.py

+        os.makedirs(dir_path, exist_ok=True)
+        return dir_path
+
+    def create_weight_matrix(self, data_chunk, chunk_id, dataset_type, tree):


각 argument의 형태를 추측하기 어렵습니다. 타입 힌트가 필요해 보입니다.

ilovemyminutes · 2024-11-03T23:29:34Z

code/models/SpatialWeightMatrix.py

+            for j in range(self.k):
+                weight_matrix[i, indices[i, j]] = weights[j]
+
+        sparse_matrix = csr_matrix(weight_matrix) # 생성된 공간적 가중치 행렬을 희소 행렬로 저장


(멘토링 때에도 말씀드린 부분 같지만) 본 공간적 가중치 행렬이 sparse하지 않으면, csr_matrix를 사용할 이유가 없습니다.

ilovemyminutes · 2024-11-03T23:32:48Z

code/models/SpatialWeightMatrix.py

+        '''
+        try:
+            return joblib.load(os.path.join(self.get_save_directory(dataset_type), f'weight_matrix_chunk_{chunk_id}.pkl'))
+        except FileNotFoundError:


파일이 찾아지지 않았을 때 런타임 에러를 raise 하지 않는 건 그리 좋아보이지 않습니다.

파일이 없으면 joblib.load 단에서 알아서 에러를 낼 것이기 때문에, FileNotFoundError를 따로 빼는 것도 그리 좋아보이지 않습니다.

ilovemyminutes · 2024-11-03T23:33:08Z

code/models/XGBoostWithSpatialWeight.py

+class XGBoostWithSpatialWeight:
+
+    def __init__(self, spatial_weight_matrix, seed):
+        hyperparams = Config.XGBOOST_BEST_PARAMS


불필요한 변수 할당입니다.

ilovemyminutes · 2024-11-03T23:36:32Z

code/models/inference.py

+        if np.any(np.isinf(prediction)):
+            raise ValueError("Prediction contains Inf values. This may be due to numerical instability in the model.")
+
+        if np.any(prediction <= 0):


prediction은 0 이하일 수가 없습니다. 불필요한 에러 처리입니다.

ilovemyminutes · 2024-11-03T23:37:28Z

code/models/model.py

+from sklearn.model_selection import KFold
+from sklearn.linear_model import LinearRegression
+
+def lightgbm(X, y, fitting : bool = True):


좋은 함수명이 아닙니다. fit_lightgbm 등으로 지엇어야 합니다.

ilovemyminutes · 2024-11-03T23:39:33Z

code/models/model.py

+        X_train, y_train, X_valid, y_valid, X_test = split_feature_target(train_data_n, valid_data_n, test_data_n)
+
+        if self.origin_model == 'xgboost':
+            new_model = model.xgboost(X_train, y_train)


좋은 참조 방식 같지 않습니다. 그냥 local function을 참조하면 될 것 같아요.

new_model = xgboost(X_train, y_train)

ilovemyminutes · 2024-11-03T23:39:57Z

code/models/model.py

+def xgboost(X_train, y_train, X_valid, y_valid, optimize=False):
+
+    if optimize:
+        def objective(trial):


함수명이 모호합니다.

ilovemyminutes · 2024-11-03T23:40:33Z

code/models/model.py

+
+
+# 하이퍼파라미터 최적화 함수
+def objective(trial, model_name, X_selected, y):


함수명이 모호합니다.

ilovemyminutes · 2024-11-03T23:41:31Z

code/models/model.py

+    return np.mean(fold_mae)
+
+# 모델별 하이퍼파라미터 최적화 및 OOF 예측
+def optimize_and_predict(X_selected, y, test_data_selected, models, saved_best_params):


함수는 기본적으로 하나의 임무만 수행토록 하는 게 좋습니다. 저였다면 optimize 함수와 predict 함수를 따로 따로 작성했을 것 같아요.

ilovemyminutes · 2024-11-03T23:42:38Z

code/tabtransformer_main.py

+from tqdm import tqdm
+
+import model
+from inference import *


위에서도 이미 언급했지만, asterisk로 import하는 건 지양해주세요. 어디에서 무엇을 참조하는 지 알기 어렵습니다.

ilovemyminutes · 2024-11-03T23:43:41Z

code/utils/common_utils.py

+
+
+# train과 valid 병합 함수(total dataset 구축)
+def train_valid_concat(X_train, X_valid, y_train, y_valid):


굳이 이걸 함수로 만들 필요는 없어 보입니다.

github-classroom bot and others added 30 commits September 30, 2024 04:33

Setting up GitHub Classroom Feedback

fe2ed5c

feat: submission, mae score csv 변환 모듈 추가

2f5e7bf

- submission_to_csv : 최종 제출용 csv 생성 - mae_to_csv : 실험별로 MAE score를 확인하기 위한 csv 생성

feat: add requirements.txt

9757123

feat: add .gitignore file

1adad9b

feat: Define the data path

22e9a6f

feat: Adding seaborn

8155d2e

feat: Adding basic preprocessing

cc2a7c6

feat: Deleting unused library

e4b513c

remove: delete baseline code

0fc5f69

feat: update .gitignore

9ec8f0c

feat: add EDA result and library

0b09933

feat: add interest_rate to EDA

31eccc8

feat: map data visualization

5dd3b3a

feat: Adding time feature preprocessing function

8a57fc0

remove: Deleting an unuse file

96254fc

feat: creating a base inference file

72b8109

feat: creating base main file

b0f8cae

feat: creating basic lightgbm model

87a5b43

feat: collecting various preprocessing functions

8d3c8c5

feat: creating basic util files

a3dd3f9

remove: deleting an unused file

8f2661e

remove: deleting an unused file

3c30ded

feat: creating basic inference

e9e0231

feat: collecting various preprocessing functions

14f47e4

feat: creating base util files

8c94438

remove: deleting an unused file

bc061a6

remove: deleting an unused file

4618e3a

feat: add clustering density and centroid distance feature

3f5d9fc

feat: target 위치로 클러스터링하고 density와 distance to centroid 변수 추가

f2e2451

feat: creating mae, submission folder in result folder

fb00398

ilovemyminutes reviewed Nov 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feedback #1

Feedback #1

github-classroom bot commented Sep 30, 2024 •

edited

Loading

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024 •

edited

Loading

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

ilovemyminutes Nov 3, 2024

		@@ -0,0 +1,169 @@
		from sklearn.cluster import KMeans
		from utils.constant_utils import Config, Directory


		from features import clustering_features, count_features, deposit_features, distance_features, other_features

		def feature_engineering(train_data_ : pd.DataFrame , valid_data_ : pd.DataFrame , test_data_ : pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:



		### 최종 dataset 구축(top_20_features)
		selected_columns = Config.TOP_20_FEATURES



		# 하이퍼파라미터 최적화 함수
		def objective(trial, model_name, X_selected, y):



		# train과 valid 병합 함수(total dataset 구축)
		def train_valid_concat(X_train, X_valid, y_train, y_valid):

Feedback #1

Are you sure you want to change the base?

Feedback #1

Conversation

github-classroom bot commented Sep 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ilovemyminutes Nov 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-classroom bot commented Sep 30, 2024 •

edited

Loading

ilovemyminutes Nov 3, 2024 •

edited

Loading