Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix encoder error by converting NaN to placeholder strings in text features #152

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

kgovind0001
Copy link

Issues138: The _unique_python function will throw a TypeError when the input values contains mixed data types that cannot be sorted together, which breaks the encoder's requirement for uniform string/numeric inputs. .

To handle mixed data types (strings and pd.NA) in input data, we need to preprocess the input X to ensure uniformity before validation. This has been added now.

@LeoGrin
Copy link
Collaborator

LeoGrin commented Jan 23, 2025

Thanks a lot for contributing! In addition to the failing tests, I think the NaNs should be kept or at least transformed in the same way as NaNs in int categorical columns, as TabPFN was trained with this logic.

@kgovind0001
Copy link
Author

@LeoGrin Thank you very much for your comment. It was really useful. I had a look at that. I tried to convert it in the same manner as the int but since it needs to have same type, it did not work. The transformed value should be a string.

I am thinking something like

    if len(integer_columns) > 0:
        X[integer_columns] = X[integer_columns].astype(numeric_dtype)

    string_cols = X.select_dtypes(include=["string"]).columns
    if len(string_cols) > 0:
        X[string_cols] = X[string_cols].astype('string').fillna("None")

I tried X[string_cols] = X[string_cols].astype('string').fillna(np.nan) but it did not work.

@LeoGrin Shall I proceed with above solution for the issue ?

@LeoGrin
Copy link
Collaborator

LeoGrin commented Jan 24, 2025

@kgovind0001 Thanks for your answer. What I mean is that you need to make sure that missing values end up being encoded in the same way, just before the actual TabPFN forward pass. With your current solution I think missing values would be encoded as just another category when the type is string, which is different from what is happening if the category has type int.

@kgovind0001
Copy link
Author

@LeoGrin Can you please elaborate what does it mean ? There will indeed be a missing value for the category since it cannot be processed in the same way as the numerical feature. What does be the expected behavior here ?

@noahho
Copy link
Collaborator

noahho commented Feb 7, 2025

Thanks for the updates, @kgovind0001 and @LeoGrin.

Sklearn expects no NaNs in the string columns as NaNs are numbers and so we would have mixed dtypes. Since later on the Strings are converted to categorical integers however (atleast in the local version), a NaN could be set after all and it would be slightly more optimal for TabPFN to know there was a NaN instead of a new category.

To ensure consistent handling of NaN in string columns (similar to integer categorical features), I suggest using a temporary placeholder:

if len(integer_columns) > 0:
  X[integer_columns] = X[integer_columns].astype(numeric_dtype)

string_cols = X.select_dtypes(include=["string", "object"]).columns
if len(string_cols) > 0:
  placeholder = "__MISSING__"
  X[string_cols] = X[string_cols].fillna(placeholder)
  
  X_encoded = _get_ordinal_encoder().fit_transform(X)
  X_encoded = np.where(X[string_cols] == placeholder, np.nan, X_encoded)

Something like this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants