why it’s not possible to use n_jobs = n, like in scikit-learn #830

lukaspistelak · 2024-12-19T18:33:26Z

Hello, I would like to ask why it’s not possible to use n_jobs = n, like in scikit-learn. I have to select (3-5) features from 700 features, and it takes 2 hours. :/ So, some research is quite hard and slow. 👍


`tr = RecursiveFeatureAddition(estimator=lgb_model , cv=cv,  scoring= 'average_precision', threshold = 0.002 )

Xt = tr.fit_transform(X, y)`

Thanks

The text was updated successfully, but these errors were encountered:

solegalli · 2024-12-23T09:57:02Z

Hi @lukaspistelak

To select from 700 features, this transformer will train 700 models multiplied by the cross-validation fold. So if you set cv to 5, it will train 700 x 5 models. That might be why it takes so long. LightGBMs are sometimes also slow to train, depending on the number of trees. If the lightGBM takes n_jobs, you should set it there.

It's hard to say a priori if 2 hs is long or short because it will depend on the lightGBM, the size of your data and your available computing resources. If you send more details about how you set up the entire search, I might be able to provide some tips.

Cheers

lukaspistelak · 2024-12-23T13:17:59Z

Thanks for your response and help! 😊

I tried to add the n_jobs parameter, but it didn’t help. 😕

Here are the LightGBM model parameters I’m using:

params = {
    'objective': 'binary',
    'boosting_type': 'gbdt',
    
    'max_depth': 5 ,        # Smaller tree, less complexity
    
    'lambda_l1': 0.1 ,      # L1 regularization
    'lambda_l2': 0.1 ,      # L2 regularization
    
    # 'learning_rate': 0.1, # Lower learning rate for more gradual training
    'verbose': -1,
    'n_jobs' : 3  # Suppress output
}

num_round = 5

: cv is not 5 , but 45
size of data is cca 3k rows and 700 columns
The features are generated using the same method (a transformer) with different parameters, so I need to select the features with the best parameters.
features with high correlation, can be selected > it doesn't mean that they are without any useful information

solegalli · 2024-12-25T19:00:17Z

Why do you use 45 as cv? That makes the selector train 700 x 45 models, which is what's making it take so long. I normally use 3 or 5.

Feature-engine relies heavily on sklearn, so we leverage the n_jobs parameter implemented in most sklearn classes. We don't add parallelization on top of the parallelization already contained in sklearn, because most of our routines are not so computationally heavy.

lukaspistelak · 2024-12-28T14:08:09Z

Why do you use 45 as cv? That makes the selector train 700 x 45 models, which is what's making it take so long. I normally use 3 or 5.

Feature-engine relies heavily on sklearn, so we leverage the n_jobs parameter implemented in most sklearn classes. We don't add parallelization on top of the parallelization already contained in sklearn, because most of our routines are not so computationally heavy.

cv = CombPurgedKFoldCVLocal( n_splits = 10 , n_test_splits= 3 , X_index = X.index )

have 45 combinations

solegalli · 2024-12-30T20:44:11Z

Interesting! Thank you. I haven't heard of that cross-validation framework before.

RecursiveFeatureAddition will test all features, in your case 700. So if you need to select just 3, it will be a lot of testing for no reason. It will also not select just 3, but the number that satisfy the threshold condition. We could add functionality to make it stop after a number of features has been found in the next round of updates of Feature-engine.

An alternative and similar search would be the SFS from MLXtend, setting the search to forward. You can make that transformer stop after it finds a certain number of features, and therefore, if you stop at 5, it should, in theory, take less time, although the search procedure is not identical to RecursiveFeatureAddition (but more or less).

Another alternative, is to set up a simpler LightGBM. If you check out the theory on successive halving in sklearn, you'll see that with simpler models, you can already find out what works best for the model. So you could train a lightGBM with less estimators and shallower depth, reduce the feature space from 700 to 20, and then increase the complexity of the model and finalize the set of features, if that makes sense.

I hope this helps!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why it’s not possible to use n_jobs = n, like in scikit-learn #830

why it’s not possible to use n_jobs = n, like in scikit-learn #830

lukaspistelak commented Dec 19, 2024

solegalli commented Dec 23, 2024

lukaspistelak commented Dec 23, 2024 •

edited

Loading

solegalli commented Dec 25, 2024

lukaspistelak commented Dec 28, 2024

solegalli commented Dec 30, 2024

why it’s not possible to use n_jobs = n, like in scikit-learn #830

why it’s not possible to use n_jobs = n, like in scikit-learn #830

Comments

lukaspistelak commented Dec 19, 2024

solegalli commented Dec 23, 2024

lukaspistelak commented Dec 23, 2024 • edited Loading

solegalli commented Dec 25, 2024

lukaspistelak commented Dec 28, 2024

solegalli commented Dec 30, 2024

lukaspistelak commented Dec 23, 2024 •

edited

Loading