-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feedback #1
base: feedback
Are you sure you want to change the base?
Feedback #1
Conversation
- submission_to_csv : 최종 제출용 csv 생성 - mae_to_csv : 실험별로 MAE score를 확인하기 위한 csv 생성
from sklearn.neighbors import KDTree | ||
from sklearn.cluster import KMeans | ||
|
||
#from geopy.distance import great_circle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
사용하지 않는 코드는 지워야 합니다.
### 클러스터링 | ||
|
||
# clustering 함수 | ||
def clustering(total_df, info_df, feat_name, n_clusters=20): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 함수명은 동사형으로 작성하는 게 좋습니다.
- 함수에 대한 설명은 함수 바깥이 아닌 안에 적습니다.
def make_clustering(total_df, info_df, feat_name, n_clusters=20):
"""clustring 함수"""
...
# n 개월 동일한 아파트 거래량 함수 | ||
def transaction_count_function(train_data: pd.DataFrame, valid_data: pd.DataFrame, test_data: pd.DataFrame, months: int = 3) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: | ||
# 파일 경로 설정 | ||
transaction_folder = os.path.join(Directory.root_path, 'data/transaction_data') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pathlib을 사용하면 더 깔끔하게 작성할 수 있습니다.
- 035 파일 경로를 객체로 다루려면? ― pathlib - https://wikidocs.net/110182
total_data[f'transaction_count_last_{months}_months'] = 0 | ||
|
||
# 위도, 경도, 건축 연도로 그룹화 | ||
grouped = total_data.groupby(['latitude', 'longitude', 'built_year', 'area_m2']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
변수명이 그리 좋아보이지 않습니다. 좀더 명시적으로 작성해주세요.
def create_school_counts_within_radius_by_school_level(train_data: pd.DataFrame, valid_data: pd.DataFrame, test_data: pd.DataFrame, radius : float = 0.02) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: | ||
school_info = Directory.school_info | ||
seoul_area_school = school_info[(school_info['latitude'] >= 37.0) & (school_info['latitude'] <= 38.0) & | ||
(school_info['longitude'] >= 126.0) & (school_info['longitude'] <= 128.0)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thresholding을 위한 상수값이 나오는데, 이건 따로 변수로 빼주고 사용하는 게 좋아보입니다.
SCHOOL_LATITUDE_BOUNDARY = (37.0, 38.0)
SCHOOL_LONGITUDE_BOUNDARY = (126.0, 128.0)
"enable_categorical": True | ||
} | ||
|
||
class Directory: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 클래스 변수로 판다스 데이터프레임을 할당해두는 건 그리 좋은 것 같지 않습니다. 인스턴스 변수로 할당해두는 게 더 효율적일 것 같습니다.
Directory
라는 클래스명이 그리 적절해보이지 않습니다.
# 반경 내 지하철 개수 함수 | ||
def create_subway_within_radius(train_data: pd.DataFrame, valid_data: pd.DataFrame, test_data: pd.DataFrame, radius : float = 0.01) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: | ||
# subwayInfo에는 지하철 역의 위도와 경도가 포함되어 있다고 가정 | ||
subwayInfo = Directory.subway_info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
파이썬에서는 변수명으로 camelcase를 사용하지 않습니다.
@@ -0,0 +1,169 @@ | |||
from sklearn.cluster import KMeans | |||
from utils.constant_utils import Config, Directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
불필요한 import 는 제거해주세요.
if self.mode=="train" or self.mode=="valid": | ||
return (self.X[idx], self.y[idx]) | ||
else: | ||
return (self.X[idx]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
기본적으로 함수의 output 형태는 한가지인 것이 좋습니다. 저라면 GridDataset
에 train, valid mode로 분기하지 않고, TrainGridDataset
, ValidGridDataset
를 따로 따로 구현했을 것 같아요.
self.mode = mode | ||
df = common_utils.merge_data(Directory.train_data, Directory.test_data) | ||
|
||
### 클러스터 피처 apply |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
여기 아래부터 나와있는 부분들은, 결국 X
, y
의 인스턴스 변수를 할당하기 위한 것으로 보이는데요, 이러한 전처리 과정은 MLPDataset
의 바깥에서 해결되었어야 할 부분이라고 생각합니다. Dataset의 역할은, 적절히 모델 입력에 맞는 샘플을 출력하는 것이기 때문입니다.
|
||
from features import clustering_features, count_features, deposit_features, distance_features, other_features | ||
|
||
def feature_engineering(train_data_ : pd.DataFrame , valid_data_ : pd.DataFrame , test_data_ : pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def apply_feature_engineering
등의 동사형 함수명이 더 좋아보입니다.
# 범주형 변수에 대해 One-Hot Encoding 적용 | ||
train_data_encoded = pd.get_dummies(train_data, columns=categorical_cols, drop_first=True) | ||
valid_data_encoded = pd.get_dummies(valid_data, columns=categorical_cols, drop_first=True) | ||
test_data_encoded = pd.get_dummies(test_data, columns=categorical_cols, drop_first=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
이번 프로젝트에서는 이렇게 get_dummies를 쓰면 물론 원핫인코딩이 가능하지만, 추후에 더 복잡한 프로젝트를 하게 되신다면 아래와 같은 상황을 염두에 두시면 더 좋을 것 같습니다.
- (1) 이 데이터 크기(=행 갯수)가 1억개가 넘어가도 이 코드는 잘 동작할까?
- (2) 범주형 변수의 범주 갯수가 1억개가 넘어가도 이 코드는 잘 동작할까?
- (1), (2) 중 하나라도 우려가 된다면, 어떻게 코드를 짜야 할까? 어떻게 원핫 인코딩 해야할까?
|
||
# 로그 변환 | ||
# train_data_ = pre.log_transform(train_data_, 'deposit') | ||
# valid_data_ = pre.log_transform(valid_data_, 'deposit') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
불필요한 코드는 모두 제거해야 합니다.
|
||
|
||
### 최종 dataset 구축(top_20_features) | ||
selected_columns = Config.TOP_20_FEATURES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
굳이 새로운 변수로 할당할 필요는 없었을 것 같아요
x = self.pool(self.relu(self.conv2(x))) # (N, 32, 21, 14) -> (N, 64, 10, 7) | ||
|
||
# Flatten the output of the conv layers | ||
x = x.view(-1, 64 * 10 * 7) # Flatten: (N, 64, 10, 7) -> (N, 64 * 10 * 7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
그냥 계산된 결과를 넣어두는 게 좋습니다. 불필요한 2번의 곱연산을 매 forward마다 할 필요는 없어 보여요.
|
||
class TabTransformerTrainer: | ||
def __init__(self, model, optimizer, loss_fn, device): | ||
"""TabTransformerTrainer 초기화.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
굳이 없어도 될 docstring입니다. __init__
이라는 매직 메소드 이름만으로 충분히 이해돼요.
|
||
def train(self, train_data, dataset_type): | ||
for seed in self.seeds: | ||
model_ = self.model_class(self.spatial_weight_matrix, seed=seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
불필요한 언더스코어(_
)는 달지 말아주세요.
dataset_type : train, valid, test, train_total # train_total : train + valid 통합해서 훈련 시 사용 | ||
''' | ||
dir_path = os.path.join(self.base_save_directory, dataset_type) | ||
os.makedirs(dir_path, exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
함수의 목적을 감안하면, 디렉토리 생성 부분은 없어야 할 것 같고, 디렉토리 생성 부분을 포함시키고자 한다면 함수명을 수정해야 합니다.
os.makedirs(dir_path, exist_ok=True) | ||
return dir_path | ||
|
||
def create_weight_matrix(self, data_chunk, chunk_id, dataset_type, tree): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
각 argument의 형태를 추측하기 어렵습니다. 타입 힌트가 필요해 보입니다.
for j in range(self.k): | ||
weight_matrix[i, indices[i, j]] = weights[j] | ||
|
||
sparse_matrix = csr_matrix(weight_matrix) # 생성된 공간적 가중치 행렬을 희소 행렬로 저장 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(멘토링 때에도 말씀드린 부분 같지만) 본 공간적 가중치 행렬이 sparse하지 않으면, csr_matrix를 사용할 이유가 없습니다.
''' | ||
try: | ||
return joblib.load(os.path.join(self.get_save_directory(dataset_type), f'weight_matrix_chunk_{chunk_id}.pkl')) | ||
except FileNotFoundError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 파일이 찾아지지 않았을 때 런타임 에러를 raise 하지 않는 건 그리 좋아보이지 않습니다.
- 파일이 없으면
joblib.load
단에서 알아서 에러를 낼 것이기 때문에, FileNotFoundError를 따로 빼는 것도 그리 좋아보이지 않습니다.
class XGBoostWithSpatialWeight: | ||
|
||
def __init__(self, spatial_weight_matrix, seed): | ||
hyperparams = Config.XGBOOST_BEST_PARAMS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
불필요한 변수 할당입니다.
if np.any(np.isinf(prediction)): | ||
raise ValueError("Prediction contains Inf values. This may be due to numerical instability in the model.") | ||
|
||
if np.any(prediction <= 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prediction은 0 이하일 수가 없습니다. 불필요한 에러 처리입니다.
from sklearn.model_selection import KFold | ||
from sklearn.linear_model import LinearRegression | ||
|
||
def lightgbm(X, y, fitting : bool = True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
좋은 함수명이 아닙니다. fit_lightgbm
등으로 지엇어야 합니다.
X_train, y_train, X_valid, y_valid, X_test = split_feature_target(train_data_n, valid_data_n, test_data_n) | ||
|
||
if self.origin_model == 'xgboost': | ||
new_model = model.xgboost(X_train, y_train) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
좋은 참조 방식 같지 않습니다. 그냥 local function을 참조하면 될 것 같아요.
new_model = xgboost(X_train, y_train)
def xgboost(X_train, y_train, X_valid, y_valid, optimize=False): | ||
|
||
if optimize: | ||
def objective(trial): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
함수명이 모호합니다.
|
||
|
||
# 하이퍼파라미터 최적화 함수 | ||
def objective(trial, model_name, X_selected, y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
함수명이 모호합니다.
return np.mean(fold_mae) | ||
|
||
# 모델별 하이퍼파라미터 최적화 및 OOF 예측 | ||
def optimize_and_predict(X_selected, y, test_data_selected, models, saved_best_params): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
함수는 기본적으로 하나의 임무만 수행토록 하는 게 좋습니다. 저였다면 optimize 함수와 predict 함수를 따로 따로 작성했을 것 같아요.
from tqdm import tqdm | ||
|
||
import model | ||
from inference import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
위에서도 이미 언급했지만, asterisk로 import하는 건 지양해주세요. 어디에서 무엇을 참조하는 지 알기 어렵습니다.
|
||
|
||
# train과 valid 병합 함수(total dataset 구축) | ||
def train_valid_concat(X_train, X_valid, y_train, y_valid): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
굳이 이걸 함수로 만들 필요는 없어 보입니다.
👋! GitHub Classroom created this pull request as a place for your teacher to leave feedback on your work. It will update automatically. Don’t close or merge this pull request, unless you’re instructed to do so by your teacher.
In this pull request, your teacher can leave comments and feedback on your code. Click the Subscribe button to be notified if that happens.
Click the Files changed or Commits tab to see all of the changes pushed to the default branch since the assignment started. Your teacher can see this too.
Notes for teachers
Use this PR to leave feedback. Here are some tips:
For more information about this pull request, read “Leaving assignment feedback in GitHub”.
Subscribed: @park-jaeuk @choitaesoon @jinnk0 @Cyberger @JaeEunSeo