Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does OneHotEncoder handle difference in categories in variable names? Currently a mismatch in shape? #832

Open
Morgan-Sell opened this issue Jan 6, 2025 · 0 comments

Comments

@Morgan-Sell
Copy link
Collaborator

Describe the bug
Two bugs are related:

  1. A variable that is one-hot encoded in the training dataset has categorical values that do not exist in the testing dataset.
  2. A variable that is one-hot encoded in the testing dataset has categorical values that do not exist in the ** training dataset**.

Both issues result in the transformed dataframe shapes not equalling. This results in errors in a pipeline.

To Reproduce
Steps to reproduce the behavior:

Expected behavior
OneHotEncoder needs to ensure that dataframe shapes are equal.

Proposed solution for Issue #1:

  • Reindex the test dataset using get_feature_names_out(). Something like:
expected_columns = X_train_prcsd.get_feature_names_out()
X_test_prcsd = X_test_prcsd.reindex(columns=expected_columns, fill_value=0)

Proposed solution for Issue #2:
Add an handle_unknown attribute. If the user selects the value to ignore, then new catorigical values in the test dataset will not be encoded.

Screenshots
feature-engine error that is returned:


    def _check_X_matches_training_df(X: pd.DataFrame, reference: int) -> None:
        """
        Checks that DataFrame to transform has the same number of columns that the
        DataFrame used with the fit() method.
    
        Parameters
        ----------
        X : Pandas DataFrame
            The df to be checked
        reference : int
            The number of columns in the dataframe that was used with the fit() method.
    
        Raises
        ------
        ValueError
            If the number of columns does not match.
    
        Returns
        -------
        None
        """
    
        if X.shape[1] != reference:
>           raise ValueError(
                "The number of columns in this dataset is different from the one used to "
                "fit this transformer (when using the fit() method)."
            )
E           ValueError: The number of columns in this dataset is different from the one used to fit this transformer (when using the fit() method).

venv/lib/python3.11/site-packages/feature_engine/dataframe_checks.py:239: ValueError

Desktop (please complete the following information):

  • OS: Mac Os
  • Browser: N/A
  • Version: Latest version

Additional context
feature-engine rulesss!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant