Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why it’s not possible to use n_jobs = n, like in scikit-learn #830

Open
lukaspistelak opened this issue Dec 19, 2024 · 5 comments
Open

Comments

@lukaspistelak
Copy link

Hello, I would like to ask why it’s not possible to use n_jobs = n, like in scikit-learn. I have to select (3-5) features from 700 features, and it takes 2 hours. :/ So, some research is quite hard and slow. 👍


`tr = RecursiveFeatureAddition(estimator=lgb_model , cv=cv,  scoring= 'average_precision', threshold = 0.002 )

Xt = tr.fit_transform(X, y)`

Thanks

@solegalli
Copy link
Collaborator

Hi @lukaspistelak

To select from 700 features, this transformer will train 700 models multiplied by the cross-validation fold. So if you set cv to 5, it will train 700 x 5 models. That might be why it takes so long. LightGBMs are sometimes also slow to train, depending on the number of trees. If the lightGBM takes n_jobs, you should set it there.

It's hard to say a priori if 2 hs is long or short because it will depend on the lightGBM, the size of your data and your available computing resources. If you send more details about how you set up the entire search, I might be able to provide some tips.

Cheers

@lukaspistelak
Copy link
Author

lukaspistelak commented Dec 23, 2024

Thanks for your response and help! 😊

I tried to add the n_jobs parameter, but it didn’t help. 😕

Here are the LightGBM model parameters I’m using:

params = {
    'objective': 'binary',
    'boosting_type': 'gbdt',
    
    'max_depth': 5 ,        # Smaller tree, less complexity
    
    'lambda_l1': 0.1 ,      # L1 regularization
    'lambda_l2': 0.1 ,      # L2 regularization
    
    # 'learning_rate': 0.1, # Lower learning rate for more gradual training
    'verbose': -1,
    'n_jobs' : 3  # Suppress output
}

num_round = 5
  1. : cv is not 5 , but 45
  2. size of data is cca 3k rows and 700 columns
  3. The features are generated using the same method (a transformer) with different parameters, so I need to select the features with the best parameters.
  4. features with high correlation, can be selected > it doesn't mean that they are without any useful information

@solegalli
Copy link
Collaborator

Why do you use 45 as cv? That makes the selector train 700 x 45 models, which is what's making it take so long. I normally use 3 or 5.

Feature-engine relies heavily on sklearn, so we leverage the n_jobs parameter implemented in most sklearn classes. We don't add parallelization on top of the parallelization already contained in sklearn, because most of our routines are not so computationally heavy.

@lukaspistelak
Copy link
Author

Why do you use 45 as cv? That makes the selector train 700 x 45 models, which is what's making it take so long. I normally use 3 or 5.

Feature-engine relies heavily on sklearn, so we leverage the n_jobs parameter implemented in most sklearn classes. We don't add parallelization on top of the parallelization already contained in sklearn, because most of our routines are not so computationally heavy.

cv = CombPurgedKFoldCVLocal( n_splits = 10 , n_test_splits= 3 , X_index = X.index )

have 45 combinations

@solegalli
Copy link
Collaborator

Interesting! Thank you. I haven't heard of that cross-validation framework before.

RecursiveFeatureAddition will test all features, in your case 700. So if you need to select just 3, it will be a lot of testing for no reason. It will also not select just 3, but the number that satisfy the threshold condition. We could add functionality to make it stop after a number of features has been found in the next round of updates of Feature-engine.

An alternative and similar search would be the SFS from MLXtend, setting the search to forward. You can make that transformer stop after it finds a certain number of features, and therefore, if you stop at 5, it should, in theory, take less time, although the search procedure is not identical to RecursiveFeatureAddition (but more or less).

Another alternative, is to set up a simpler LightGBM. If you check out the theory on successive halving in sklearn, you'll see that with simpler models, you can already find out what works best for the model. So you could train a lightGBM with less estimators and shallower depth, reduce the feature space from 700 to 20, and then increase the complexity of the model and finalize the set of features, if that makes sense.

I hope this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants