Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: arrow type error when show data with 'UUID' object #8532

Open
1 task done
jitingxu1 opened this issue Mar 4, 2024 · 8 comments · May be fixed by #8538
Open
1 task done

bug: arrow type error when show data with 'UUID' object #8532

jitingxu1 opened this issue Mar 4, 2024 · 8 comments · May be fixed by #8538
Assignees
Labels
bug Incorrect behavior inside of ibis

Comments

@jitingxu1
Copy link
Contributor

What happened?

Issue 1: duckdb will produce different uuid for each row, but same uuid generated by sqlite, there maybe other backends have the same issue.

import ibis
ibis.options.interactive = True
from ibis.expr.api import row_number, uuid, now, pi

ibis.set_backend("sqlite")
t = ibis.examples.penguins.fetch()
t.mutate(uuid=ibis.uuid()).to_pandas()
image

Issue 2: get ArrowTypeError when show data:

import ibis
ibis.options.interactive = True
from ibis.expr.api import row_number, uuid, now, pi

ibis.set_backend("sqlite")
t = ibis.examples.penguins.fetch()
t1 = t.mutate(uuid=uuid())
t1[t1.my_uuid].head()

Got the following error:

Out[7]: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/voltrondata/repos/ibis/ibis/expr/types/relations.py:516 in __interactive_rich_console__   │
│                                                                                                  │
│    513 │   │   │   width = options.max_width                                                     │
│    514 │   │                                                                                     │
│    515 │   │   try:                                                                              │
│ ❱  516 │   │   │   table = to_rich_table(self, width)                                            │
│    517 │   │   except Exception as e:                                                            │
│    518 │   │   │   # In IPython exceptions inside of _repr_mimebundle_ are swallowed to          │
│    519 │   │   │   # allow calling several display functions and choosing to display             │
│                                                                                                  │
│ /Users/voltrondata/repos/ibis/ibis/expr/types/pretty.py:265 in to_rich_table                     │
│                                                                                                  │
│   262 │                                                                                          │
│   263 │   # Compute the data and return a pandas dataframe                                       │
│   264 │   nrows = ibis.options.repr.interactive.max_rows                                         │
│ ❱ 265 │   result = table.limit(nrows + 1).to_pyarrow()                                           │
│   266 │                                                                                          │
│   267 │   # Now format the columns in order, stopping if the console width would                 │
│   268 │   # be exceeded.                                                                         │
│                                                                                                  │
│ /Users/voltrondata/repos/ibis/ibis/expr/types/core.py:425 in to_pyarrow                          │
│                                                                                                  │
│   422 │   │   Table                                                                              │
│   423 │   │   │   A pyarrow table holding the results of the executed expression.                │
│   424 │   │   """                                                                                │
│ ❱ 425 │   │   return self._find_backend(use_default=True).to_pyarrow(                            │
│   426 │   │   │   self, params=params, limit=limit, **kwargs                                     │
│   427 │   │   )                                                                                  │
│   428                                                                                            │
│                                                                                                  │
│ /Users/voltrondata/repos/ibis/ibis/backends/__init__.py:218 in to_pyarrow                        │
│                                                                                                  │
│    215 │   │   table_expr = expr.as_table()                                                      │
│    216 │   │   schema = table_expr.schema()                                                      │
│    217 │   │   arrow_schema = schema.to_pyarrow()                                                │
│ ❱  218 │   │   with self.to_pyarrow_batches(                                                     │
│    219 │   │   │   table_expr, params=params, limit=limit, **kwargs                              │
│    220 │   │   ) as reader:                                                                      │
│    221 │   │   │   table = pa.Table.from_batches(reader, schema=arrow_schema)                    │
│                                                                                                  │
│ /Users/voltrondata/repos/ibis/ibis/backends/sqlite/__init__.py:264 in to_pyarrow_batches         │
│                                                                                                  │
│   261 │   │   │   self.compile(expr, limit=limit, params=params)                                 │
│   262 │   │   ) as cursor:                                                                       │
│   263 │   │   │   df = self._fetch_from_cursor(cursor, schema)                                   │
│ ❱ 264 │   │   table = pa.Table.from_pandas(                                                      │
│   265 │   │   │   df, schema=schema.to_pyarrow(), preserve_index=False                           │
│   266 │   │   )                                                                                  │
│   267 │   │   return table.to_reader(max_chunksize=chunk_size)                                   │
│                                                                                                  │
│ in pyarrow.lib.Table.from_pandas:3874                                                            │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/pyarrow/pandas_compat │
│ .py:611 in dataframe_to_arrays                                                                   │
│                                                                                                  │
│    608 │   │   │   │   issubclass(arr.dtype.type, np.integer))                                   │
│    609 │                                                                                         │
│    610 │   if nthreads == 1:                                                                     │
│ ❱  611 │   │   arrays = [convert_column(c, f)                                                    │
│    612 │   │   │   │     for c, f in zip(columns_to_convert, convert_fields)]                    │
│    613 │   else:                                                                                 │
│    614 │   │   arrays = []                                                                       │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/pyarrow/pandas_compat │
│ .py:611 in <listcomp>                                                                            │
│                                                                                                  │
│    608 │   │   │   │   issubclass(arr.dtype.type, np.integer))                                   │
│    609 │                                                                                         │
│    610 │   if nthreads == 1:                                                                     │
│ ❱  611 │   │   arrays = [convert_column(c, f)                                                    │
│    612 │   │   │   │     for c, f in zip(columns_to_convert, convert_fields)]                    │
│    613 │   else:                                                                                 │
│    614 │   │   arrays = []                                                                       │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/pyarrow/pandas_compat │
│ .py:598 in convert_column                                                                        │
│                                                                                                  │
│    595 │   │   │   │   pa.ArrowTypeError) as e:                                                  │
│    596 │   │   │   e.args += ("Conversion failed for column {!s} with type {!s}"                 │
│    597 │   │   │   │   │      .format(col.name, col.dtype),)                                     │
│ ❱  598 │   │   │   raise e                                                                       │
│    599 │   │   if not field_nullable and result.null_count > 0:                                  │
│    600 │   │   │   raise ValueError("Field {} was non-nullable but pandas column "               │
│    601 │   │   │   │   │   │   │    "had {} null values".format(str(field),                      │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/pyarrow/pandas_compat │
│ .py:592 in convert_column                                                                        │
│                                                                                                  │
│    589 │   │   │   type_ = field.type                                                            │
│    590 │   │                                                                                     │
│    591 │   │   try:                                                                              │
│ ❱  592 │   │   │   result = pa.array(col, type=type_, from_pandas=True, safe=safe)               │
│    593 │   │   except (pa.ArrowInvalid,                                                          │
│    594 │   │   │   │   pa.ArrowNotImplementedError,                                              │
│    595 │   │   │   │   pa.ArrowTypeError) as e:                                                  │
│                                                                                                  │
│ in pyarrow.lib.array:340                                                                         │
│                                                                                                  │
│ in pyarrow.lib._ndarray_to_array:86                                                              │
│                                                                                                  │
│ in pyarrow.lib.check_status:91                                                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ArrowTypeError: ("Expected bytes, got a 'UUID' object", 'Conversion failed for column my_uuid with type
object')

it works well for to_pandas()

In [8]: t1[t1.my_uuid].to_pandas()
Out[8]:
                                  my_uuid
0    3f661a76-2d0e-4622-862e-1c4adcfd4813
1    3f661a76-2d0e-4622-862e-1c4adcfd4813
2    3f661a76-2d0e-4622-862e-1c4adcfd4813
3    3f661a76-2d0e-4622-862e-1c4adcfd4813
4    3f661a76-2d0e-4622-862e-1c4adcfd4813
..                                    ...
339  3f661a76-2d0e-4622-862e-1c4adcfd4813
340  3f661a76-2d0e-4622-862e-1c4adcfd4813
341  3f661a76-2d0e-4622-862e-1c4adcfd4813
342  3f661a76-2d0e-4622-862e-1c4adcfd4813
343  3f661a76-2d0e-4622-862e-1c4adcfd4813

What version of ibis are you using?

8.0.0

What backend(s) are you using, if any?

duckdb, sqlite

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@jcrist
Copy link
Member

jcrist commented Mar 4, 2024

Thanks for opening this. Issue 1 should be fixed in #8535.

Issue 2 is due to the to_pyarrow conversion path in sqlite (and a few other backends) going dbapi row -> pandas -> pyarrow. When returning pandas dataframes from to_pandas we currently map a UUID column to an object dtype series of uuid.UUID objects (and these objects fail when converting to pyarrow). In contrast, for to_pyarrow we return a string column with the same data.

The easiest (and I think most consistent) fix would be to stop returning uuid columns in to_pandas as uuid.UUID values and instead treat them as strings. This matches what we do for both polars and pyarrow outputs. It's also more efficient for the user since they don't have an object dtype series in the output series.

cc @cpcloud for a 👍 / 👎 before I implement this fix.

@kszucs
Copy link
Member

kszucs commented Mar 4, 2024

Eventually we should simplify the pandas output until .to_pyarrow().to_pandas() to offload all the conversion duties to arrow. So it is a +1 from me.

kszucs pushed a commit that referenced this issue Mar 4, 2024
@jcrist jcrist self-assigned this Mar 4, 2024
@jcrist jcrist moved this from backlog to cooking in Ibis planning and roadmap Mar 4, 2024
@cpcloud
Copy link
Member

cpcloud commented Mar 4, 2024

Seems fine. I don't like that we have to do this but the alternative of implementing a custom pyarrow type seems less desirable than converting to strings.

@cpcloud
Copy link
Member

cpcloud commented Aug 10, 2024

The repeated UUID issue has been addressed:

In [4]: import ibis
   ...: ibis.options.interactive = True
   ...: from ibis.expr.api import row_number, uuid, now, pi
   ...:
   ...: ibis.set_backend("sqlite")
   ...: t = ibis.examples.penguins.fetch()
   ...: t.mutate(uuid=ibis.uuid()).to_pandas()
Out[4]:
       species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex  year                                  uuid
0       Adelie  Torgersen            39.1           18.7              181.0       3750.0    male  2007  f3102c7e-167c-4854-af20-d3729580e2cc
1       Adelie  Torgersen            39.5           17.4              186.0       3800.0  female  2007  86afae37-f0e5-48d7-ba13-aa701374d4cd
2       Adelie  Torgersen            40.3           18.0              195.0       3250.0  female  2007  665cbe36-d7b7-4a7e-bd5f-ebf0c043c72f
3       Adelie  Torgersen             NaN            NaN                NaN          NaN    None  2007  a740b304-0a13-4f89-bdb4-fa9475f2daa4
4       Adelie  Torgersen            36.7           19.3              193.0       3450.0  female  2007  a8263a30-9cbb-4175-94e9-1429a6fdb0fa
..         ...        ...             ...            ...                ...          ...     ...   ...                                   ...
339  Chinstrap      Dream            55.8           19.8              207.0       4000.0    male  2009  112e7fcc-bf14-4177-96ee-526d9343c368
340  Chinstrap      Dream            43.5           18.1              202.0       3400.0  female  2009  c5964727-4dc0-42dd-8039-1527bd37b673
341  Chinstrap      Dream            49.6           18.2              193.0       3775.0    male  2009  a3d1a137-5847-4309-90c4-59d0f8fe35f9
342  Chinstrap      Dream            50.8           19.0              210.0       4100.0    male  2009  33da8b0b-d368-442c-ba05-44daa037b1e0
343  Chinstrap      Dream            50.2           18.7              198.0       3775.0  female  2009  de5b1031-de4a-4c13-85a6-920e4741f922

@cpcloud cpcloud changed the title bug: same uuid for all rows and got arrow type error when show data with 'UUID' object bug: arrow type error when show data with 'UUID' object Aug 10, 2024
@double-thinker
Copy link

double-thinker commented Sep 2, 2024

I think there is a simple and self-contained solution to UUID types and any other type that does not have a 1:1 mapping between Ibis and PyArrow, such as MAC addresses, UUIDs, etc.

Let me know what you think. I am willing to add tests and make a PR if this approach makes sense.

This commit only checks for bijectivity between Ibis and PyArrow types, and if any column of the schema does not comply, then it is cast server-side.

There are some pros:

  • Ibis does not need to handle special cases backend by backend. PyArrowType mappings define this behavior for all backends that use PyArrow conversion.
  • It could easily extend to future types.
  • "Type-loss" conversions could be cast into other types that are not dt.string, although currently the only relevant ones are mapped to string.
  • Sequential expr.to_pyarrow().to_X() will be consistent with expr.to_X().
  • It does not require pandas types, so making Ibis more "pandas-agnostic" seems important based on feat: support UUIDs to pyarrow on more backends #8901 conversations.
  • This affects only operations that previously would fail. It does not change any defined behavior.

Cons:

  • Without a warning, there is an implicit casting, but without any data loss.
  • It does not keep any metadata if the user converts them back to an Ibis table. I know there are some conversations about extending types with metadata, so I guess there is no "right way" to do this yet.
  • The cast is done server-side, which could be unexpected sometimes, but IMO this is a pro since Ibis is agnostic to the backend: if there is no explicit 1:1 conversion to a PyArrow type, then the backend's casting is the best (maybe only?) defined behavior we can rely on.

@double-thinker
Copy link

@cpcloud Do you think the approach I developed is worthwhile? I can submit a PR in the following days if there's interest.

@kylebarron
Copy link

At least for UUID, since Arrow has a canonical extension type for it, it seems like that would be a great way to maintain the type information.

1 similar comment
@kylebarron
Copy link

At least for UUID, since Arrow has a canonical extension type for it, it seems like that would be a great way to maintain the type information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis
Projects
Status: cooking
Development

Successfully merging a pull request may close this issue.

6 participants