Fix encoder error by converting NaN to placeholder strings in text features #152

kgovind0001 · 2025-01-22T12:34:40Z

Issues138: The _unique_python function will throw a TypeError when the input values contains mixed data types that cannot be sorted together, which breaks the encoder's requirement for uniform string/numeric inputs. .

To handle mixed data types (strings and pd.NA) in input data, we need to preprocess the input X to ensure uniformity before validation. This has been added now.

…atures

LeoGrin · 2025-01-23T09:53:29Z

Thanks a lot for contributing! In addition to the failing tests, I think the NaNs should be kept or at least transformed in the same way as NaNs in int categorical columns, as TabPFN was trained with this logic.

kgovind0001 · 2025-01-23T14:00:39Z

@LeoGrin Thank you very much for your comment. It was really useful. I had a look at that. I tried to convert it in the same manner as the int but since it needs to have same type, it did not work. The transformed value should be a string.

I am thinking something like

    if len(integer_columns) > 0:
        X[integer_columns] = X[integer_columns].astype(numeric_dtype)

    string_cols = X.select_dtypes(include=["string"]).columns
    if len(string_cols) > 0:
        X[string_cols] = X[string_cols].astype('string').fillna("None")

I tried X[string_cols] = X[string_cols].astype('string').fillna(np.nan) but it did not work.

@LeoGrin Shall I proceed with above solution for the issue ?

LeoGrin · 2025-01-24T08:52:57Z

@kgovind0001 Thanks for your answer. What I mean is that you need to make sure that missing values end up being encoded in the same way, just before the actual TabPFN forward pass. With your current solution I think missing values would be encoded as just another category when the type is string, which is different from what is happening if the category has type int.

kgovind0001 · 2025-02-06T22:00:01Z

@LeoGrin Can you please elaborate what does it mean ? There will indeed be a missing value for the category since it cannot be processed in the same way as the numerical feature. What does be the expected behavior here ?

noahho · 2025-02-07T11:15:41Z

Thanks for the updates, @kgovind0001 and @LeoGrin.

Sklearn expects no NaNs in the string columns as NaNs are numbers and so we would have mixed dtypes. Since later on the Strings are converted to categorical integers however (atleast in the local version), a NaN could be set after all and it would be slightly more optimal for TabPFN to know there was a NaN instead of a new category.

To ensure consistent handling of NaN in string columns (similar to integer categorical features), I suggest using a temporary placeholder:

if len(integer_columns) > 0:
  X[integer_columns] = X[integer_columns].astype(numeric_dtype)

string_cols = X.select_dtypes(include=["string", "object"]).columns
if len(string_cols) > 0:
  placeholder = "__MISSING__"
  X[string_cols] = X[string_cols].fillna(placeholder)
  
  X_encoded = _get_ordinal_encoder().fit_transform(X)
  X_encoded = np.where(X[string_cols] == placeholder, np.nan, X_encoded)

Something like this?

Fix encoder error by converting NaN to placeholder strings in text fe…

6114e7b

…atures

kgovind0001 mentioned this pull request Jan 22, 2025

TabPFN fails on text with NA #138

Open

add double quotes

ff45d95

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix encoder error by converting NaN to placeholder strings in text features #152

Fix encoder error by converting NaN to placeholder strings in text features #152

kgovind0001 commented Jan 22, 2025

LeoGrin commented Jan 23, 2025

kgovind0001 commented Jan 23, 2025

LeoGrin commented Jan 24, 2025

kgovind0001 commented Feb 6, 2025

noahho commented Feb 7, 2025

Fix encoder error by converting NaN to placeholder strings in text features #152

Are you sure you want to change the base?

Fix encoder error by converting NaN to placeholder strings in text features #152

Conversation

kgovind0001 commented Jan 22, 2025

LeoGrin commented Jan 23, 2025

kgovind0001 commented Jan 23, 2025

LeoGrin commented Jan 24, 2025

kgovind0001 commented Feb 6, 2025

noahho commented Feb 7, 2025