-
-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New feature: Lag or windows features grouped by #727
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #727 +/- ##
==========================================
+ Coverage 98.31% 98.33% +0.01%
==========================================
Files 103 103
Lines 3928 3965 +37
Branches 764 774 +10
==========================================
+ Hits 3862 3899 +37
Misses 23 23
Partials 43 43 ☔ View full report in Codecov by Sentry. |
feature_engine/timeseries/forecasting/base_forecast_transformers.py
Outdated
Show resolved
Hide resolved
_check_X_matches_training_df, | ||
check_X, | ||
) | ||
from feature_engine.dataframe_checks import (_check_contains_inf, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that black, isort and flack8 will automatically adjust the formatting and code style, because no problems appeared during development, and CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that this is the case...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On circleCI we have automatic checks for formatting, but it does not fix it automatically. You need to isort and black your files before pushing them for the tests to pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to resolve it, and push again.
if isinstance(group_by_variables, list): | ||
if len(set(group_by_variables)) != len(group_by_variables): | ||
raise ValueError("group_by_variables contains duplicate values") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To improve the code I will do the following: after the checks on group_by_variables
, I will put this variable into a list if the variable is just a string. In this way from now on you know that you are working with a list of string and you don't need to do other checks (like for example the one at line 180).
You could do something like
if group_by_variables:
if isinstance(group_by_variables, str):
self.group_by_variables = [group_by_variables]
elif not (
isinstance(group_by_variables, list) and
all(isinstance(element, str) for element in group_by_variables)
):
raise ValueError(
"group_by_variables must be an string or a list of strings. "
f"Got {group_by_variables} instead."
)
else:
# note that if you are here, then group_by_variables is a list
if len(set(group_by_variables)) != len(group_by_variables):
raise ValueError("group_by_variables contains duplicate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, for consistency with the sklearn api we cannot modify the input parameters. We need to leave them as they are.
@@ -51,6 +42,9 @@ class BaseForecastTransformer(BaseEstimator, TransformerMixin, GetFeatureNamesOu | |||
|
|||
{drop_original} | |||
|
|||
group_by_variables: str, list of str, default=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd call this method group_by
to keep it similar to pandas method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we do allow variable names to be integers, so in theory it could also take int and list of ints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sure I am going to rename this parameter, and also allow to pass it as integer or list of integers, and use check_variables methods to check if it exists in dataframe, thank you.
# check if passed list has duplicates. | ||
if isinstance(group_by_variables, list): | ||
if len(set(group_by_variables)) != len(group_by_variables): | ||
raise ValueError("group_by_variables contains duplicate values") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like tests for these 2 error catches are missing? Haven't gotten to the end of the PR so forgive me if I am wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to add some test cases for them.
@@ -165,6 +177,18 @@ def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None): | |||
if self.missing_values == "raise": | |||
self._check_na_and_inf(X) | |||
|
|||
if self.group_by_variables: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add this functionality to feature-engine? Pandas will fail if the variables are not in the dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@solegalli Yes, That's True Pandas will fail because of that, I am going to remove it, and push again.
@@ -165,6 +177,18 @@ def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None): | |||
if self.missing_values == "raise": | |||
self._check_na_and_inf(X) | |||
|
|||
if self.group_by_variables: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing that comes before this, in line 168, we are explicity forbidding duplicate values in the index, but if we allow groupby, we need to think what we do with this check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@solegalli I think no duplicates will be in resultant dataframe, because we are creating the features for every group like the example below:
>>> import pandas as pd
>>> from feature_engine.timeseries.forecasting import ExpandingWindowFeatures
>>> X = pd.DataFrame(dict(date = ["2022-09-18",
>>> "2022-09-19",
>>> "2022-09-20",
>>> "2022-09-21",
>>> "2022-09-22",
>>> "2022-09-18",
>>> "2022-09-19",
>>> "2022-09-20",
>>> "2022-09-21",
>>> "2022-09-22"],
>>> x1 = [1,2,3,4,5, 3,5,6,8,11],
>>> x2 = [6,7,8,9,10, 2,9,10,15,2],
>>> x3=['a','a','a','a','a', 'b','b','b','b','b']
>>> ))
>>> ewf = ExpandingWindowFeatures(group_by_variables='x3')
>>> ewf.fit_transform(X)
the result is:
date x1 x2 x3 x1_expanding_mean x2_expanding_mean
0 2022-09-18 1 6 a NaN NaN
1 2022-09-19 2 7 a 1.000000 6.0
2 2022-09-20 3 8 a 1.500000 6.5
3 2022-09-21 4 9 a 2.000000 7.0
4 2022-09-22 5 10 a 2.500000 7.5
5 2022-09-18 3 2 b NaN NaN
6 2022-09-19 5 9 b 3.000000 2.0
7 2022-09-20 6 10 b 4.000000 5.5
8 2022-09-21 8 15 b 4.666667 7.0
9 2022-09-22 11 2 b 5.500000 9.0
returned expanding window features | ||
""" | ||
tmp_data = [] | ||
for _, group in grouped_df: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do we need to loop?
Are we creating a grouped df for every variable passed to group_by_variables?
And is this the desired functionality? For time series forecasting, would we not have all ts in 1 col and then we would group by one or more variables that identify the ts, but we would not create many groups?
When would we need to create many groups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me explain what I need to do here, the reason behind adding group_by_variables
to time series transformers is because of this issue #668 , when we need to create some lags, rolling window, or expanding window features based on a set of groups.
the above code loop over the set of groups to create the features for every group then concatenate them, and sort by index to return the dataframe to its original
let me explain it in the following code
X = pd.DataFrame(dict(date = ["2022-09-18",
"2022-09-19",
"2022-09-20",
"2022-09-21",
"2022-09-22",
"2022-09-18",
"2022-09-19",
"2022-09-20",
"2022-09-21",
"2022-09-22"],
x1 = [1,2,3,4,5, 3,5,6,8,11],
x2 = [6,7,8,9,10, 2,9,10,15,2],
x3=['a','a','a','a','a', 'b','b','b','b','b'],
x4=['c','c','c','w','w','c','c','w','w','w']
))
X_grouped = X.groupby(['x3', 'x4'])
for _, group in X_grouped:
print(group)
the result is the dataframes of every group of ('x3', 'x4')
date x1 x2 x3 x4
0 2022-09-18 1 6 a c
1 2022-09-19 2 7 a c
2 2022-09-20 3 8 a c
date x1 x2 x3 x4
3 2022-09-21 4 9 a w
4 2022-09-22 5 10 a w
date x1 x2 x3 x4
5 2022-09-18 3 2 b c
6 2022-09-19 5 9 b c
date x1 x2 x3 x4
7 2022-09-20 6 10 b w
8 2022-09-21 8 15 b w
9 2022-09-22 11 2 b w
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thank you for the explanation. Pandas should apply shift and rolling and expanding to the groups out of the box, there is no need to loop, as far as I understand. See for example these resources: https://www.statology.org/pandas-lag-by-group/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Ezzaldin97
Thank you so much for the enormous contribution!
Apologies for the delayed review. There are a few things that I don't understand regarding the logic we are trying to implement.
this transformer is designed for forecasting. In forecasting we normally have all time series in one column and then we have one or more columns that identify them with an id or a combination of characteristics for example store number and product number.
So I thought we'd group one either by one or many variables. But the loop threw me off. maybe there is something that I am not understanding?
Or maybe you see a use case for multiple grouping? in which case we'd need to allow both functionalities to coexist,
Thank you for your patience and support!
hi @solegalli, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Ezzaldin97
Thank you so much for the amount of work that went into implementing this functionality. I am really sorry for the delay in taking a look at the proposed changes. I know it is frustrating when you put a lot of work towards something, and it is not acknowledged. I apologize. I had a very tight agenda in the last 2 months, which means that I couldn't really catch up with what was going on with Feature-engine.
I think we are aligned with the idea of the functionality that we need to add to these classes. My first impression is that the logic should be much simpler, because pandas can groupby, and then apply shift and rolling to the groups directly, without us having to loop over the groups.
It would be great if you could take a look and try to simplify the logic of the classes and outsource everything to pandas.
Let me know if this makes sense. Thank you so much for your effort!
@@ -475,7 +475,7 @@ def fit(self, X: pd.DataFrame, y: pd.Series = None): | |||
threshold_cat = self.threshold | |||
|
|||
# Compute the PSI by looping over the features | |||
self.psi_values_ = {} | |||
self.psi_values_: Dict = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We resolved this in a different PR. Could we remove this change from here please?
@@ -51,6 +51,9 @@ class BaseForecastTransformer(BaseEstimator, TransformerMixin, GetFeatureNamesOu | |||
|
|||
{drop_original} | |||
|
|||
group_by: str, str, int, or list of strings or integers, default=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are using pandas groupby under the hood, so the docs here should probably be identical or just a summary of what we see here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html and then refer the user to pandas groupby's documentation for more details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated it using a summary of pandas groupby
@@ -81,6 +85,7 @@ def __init__( | |||
self.variables = _check_variables_input_value(variables) | |||
self.missing_values = missing_values | |||
self.drop_original = drop_original | |||
self.group_by = _check_variables_input_value(group_by) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we defer the functionality to pandas, then we don't need this check. We just assign and let pandas handle the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, pandas will handle it, Thanks for your help 🙏
@@ -93,6 +93,9 @@ class ExpandingWindowFeatures(BaseForecastTransformer): | |||
|
|||
{drop_original} | |||
|
|||
group_by: str, str, int, or list of strings or integers, default=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are repeating the same string over and over, instead of writing it multiple times, we'd create a single text in the _docstrings
module, and import it instead. Like we do with fit_transform
for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added to _docstring, Thanks for the clarification.
@@ -139,6 +142,36 @@ class ExpandingWindowFeatures(BaseForecastTransformer): | |||
2 2022-09-20 3 8 1.5 6.5 | |||
3 2022-09-21 4 9 2.0 7.0 | |||
4 2022-09-22 5 10 2.5 7.5 | |||
create expanding window features based on other variables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the example in the class' docstrings is just meant for the user to "copy and paste" a simple example, not a full blown demo. For that we have the user guide. Could we please keep the original example?
""" | ||
tmp_data = [] | ||
for _, group in grouped_df: | ||
tmp = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to loop over each group. Pandas does that under the hood if I recall correctly. So we'd just add groupby before .expanding. Check these resources:
https://www.statology.org/pandas-lag-by-group/
https://stackoverflow.com/questions/37231844/pandas-creating-a-lagged-column-with-grouped-data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a simple way to perform the group_by operation to calculate expanding window features using the .apply() method in pandas
@@ -117,6 +120,26 @@ class LagFeatures(BaseForecastTransformer): | |||
2 2022-09-20 3 8 2.0 7.0 1.0 6.0 | |||
3 2022-09-21 4 9 3.0 8.0 2.0 7.0 | |||
4 2022-09-22 5 10 4.0 9.0 3.0 8.0 | |||
create lags based on other variables. | |||
>>> import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please keep the original example? Demos go in the user-guide :)
lag feature or dataframe of lag features | ||
""" | ||
tmp_data = [] | ||
for _, group in grouped_df: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to loop over the groups to apply the lags? pandas does the lags per group automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried many approaches to simplify this approach, but it is only working when using periods
argument with shift()
method like the line in 231
, however when using freq
argument with shift()
method it doesn't work, so I used loop to make it work.
kindly advice if we can simplify it.
returned expanding window features | ||
""" | ||
tmp_data = [] | ||
for _, group in grouped_df: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thank you for the explanation. Pandas should apply shift and rolling and expanding to the groups out of the box, there is no need to loop, as far as I understand. See for example these resources: https://www.statology.org/pandas-lag-by-group/
Hi @solegalli |
@solegalli @Ezzaldin97 if it is ok with you, I would like to implement lag and window feature updates. I wanted to first confirm the design of the implementation before starting. When working with multiple time series, we often have covariates in time-series order, some having future values available (such as weather features), some not; we also use these lag and window features for a target. Therefore, for each variable, one needs to apply different window sizes, and also, for each window, different aggregation methods become important. I suggest using the define Python
after agreeing on the design, I can start on this feature. |
Hey @KananMahammadli I do agree with you that we'd use different windows and different operations for different features, and allowing the split with the dictionary will be handy for the most advanced user. For less experienced users, the more the class can resolve by itself the better, and in this sense, applying the same window and same functions to all features is sort of the simplest. This PR is about allowing the transformers to create and add the lags and windows grouped by a third variable or variables that would, generally indicate some grouping, like name of the product, or land of sale, or this kind of thing. A different PR could be to extend the functionality of the class so that it offers more granularity for the advanced user in the method that you just described. And if we do so, we need to do it in a way that does not break backward compatibility whenever possible. I'd prefer if we could add the grouping function first and the granularity later. But this is open source, so we welcome all help, if you want to start otherwise, you are free to do it :) |
hi @solegalli, Sorry for not being clear enough; since this PR doesn't move forward, I thought of handling groupby in a separate PR (because finding why implementation here fails and correcting the bugs is time-consuming, modifying the current version of the feature engine to support groups is a better option). While adding the groupby, I also considered adding the flexibility of variables. I agree that it can confuse some users. Therefore I would like to get your opinion; we can do it in two ways I think:
|
Thanks for the detail. Feel free to start a new PR for the groupby function. Please add a note to mention that it superseeds this PR so we can keep track :) I think single class is easier to maintain and it avoids boilerplate. So I'd suggest sticking to one class, unless the code base is very different, in which case, it could be justified. For end users I think it gets confusing if we have 2 classes that do more or less the same. If we add the new functionality in one class, it helps if we keep the addition of additional parameters to the minimum. Here for example:
If the user enters an imputer_dict, that superseeds the parameter arbitrary value, and we make it clear in the docstrings, so that the user only needs to change one parameter for the intended functionality. The classes for lags and windows, as they stand today, they have the same parameters that pandas lag and rolling and shift. So I'd probably would not change that, because the users, or at least some of them, are familiar with those parameters already. So it is easy to jump from pandas to feature-engine. I would add an extra parameter called, I don't know lag_dictionary, or, something informative (can't think of anything now), and when that parameter is not None, then we have the special functionality that you mention. It would be easier to review, if we have 1 pr for each change. Thanks a lot for the initiative. |
Create a New feature for the #668 issue