bug: arrow type error when show data with 'UUID' object #8532

jitingxu1 · 2024-03-04T00:42:28Z

What happened?

Issue 1: duckdb will produce different uuid for each row, but same uuid generated by sqlite, there maybe other backends have the same issue.

import ibis
ibis.options.interactive = True
from ibis.expr.api import row_number, uuid, now, pi

ibis.set_backend("sqlite")
t = ibis.examples.penguins.fetch()
t.mutate(uuid=ibis.uuid()).to_pandas()

Issue 2: get ArrowTypeError when show data:

import ibis
ibis.options.interactive = True
from ibis.expr.api import row_number, uuid, now, pi

ibis.set_backend("sqlite")
t = ibis.examples.penguins.fetch()
t1 = t.mutate(uuid=uuid())
t1[t1.my_uuid].head()

Got the following error:

Out[7]: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/voltrondata/repos/ibis/ibis/expr/types/relations.py:516 in __interactive_rich_console__   │
│                                                                                                  │
│    513 │   │   │   width = options.max_width                                                     │
│    514 │   │                                                                                     │
│    515 │   │   try:                                                                              │
│ ❱  516 │   │   │   table = to_rich_table(self, width)                                            │
│    517 │   │   except Exception as e:                                                            │
│    518 │   │   │   # In IPython exceptions inside of _repr_mimebundle_ are swallowed to          │
│    519 │   │   │   # allow calling several display functions and choosing to display             │
│                                                                                                  │
│ /Users/voltrondata/repos/ibis/ibis/expr/types/pretty.py:265 in to_rich_table                     │
│                                                                                                  │
│   262 │                                                                                          │
│   263 │   # Compute the data and return a pandas dataframe                                       │
│   264 │   nrows = ibis.options.repr.interactive.max_rows                                         │
│ ❱ 265 │   result = table.limit(nrows + 1).to_pyarrow()                                           │
│   266 │                                                                                          │
│   267 │   # Now format the columns in order, stopping if the console width would                 │
│   268 │   # be exceeded.                                                                         │
│                                                                                                  │
│ /Users/voltrondata/repos/ibis/ibis/expr/types/core.py:425 in to_pyarrow                          │
│                                                                                                  │
│   422 │   │   Table                                                                              │
│   423 │   │   │   A pyarrow table holding the results of the executed expression.                │
│   424 │   │   """                                                                                │
│ ❱ 425 │   │   return self._find_backend(use_default=True).to_pyarrow(                            │
│   426 │   │   │   self, params=params, limit=limit, **kwargs                                     │
│   427 │   │   )                                                                                  │
│   428                                                                                            │
│                                                                                                  │
│ /Users/voltrondata/repos/ibis/ibis/backends/__init__.py:218 in to_pyarrow                        │
│                                                                                                  │
│    215 │   │   table_expr = expr.as_table()                                                      │
│    216 │   │   schema = table_expr.schema()                                                      │
│    217 │   │   arrow_schema = schema.to_pyarrow()                                                │
│ ❱  218 │   │   with self.to_pyarrow_batches(                                                     │
│    219 │   │   │   table_expr, params=params, limit=limit, **kwargs                              │
│    220 │   │   ) as reader:                                                                      │
│    221 │   │   │   table = pa.Table.from_batches(reader, schema=arrow_schema)                    │
│                                                                                                  │
│ /Users/voltrondata/repos/ibis/ibis/backends/sqlite/__init__.py:264 in to_pyarrow_batches         │
│                                                                                                  │
│   261 │   │   │   self.compile(expr, limit=limit, params=params)                                 │
│   262 │   │   ) as cursor:                                                                       │
│   263 │   │   │   df = self._fetch_from_cursor(cursor, schema)                                   │
│ ❱ 264 │   │   table = pa.Table.from_pandas(                                                      │
│   265 │   │   │   df, schema=schema.to_pyarrow(), preserve_index=False                           │
│   266 │   │   )                                                                                  │
│   267 │   │   return table.to_reader(max_chunksize=chunk_size)                                   │
│                                                                                                  │
│ in pyarrow.lib.Table.from_pandas:3874                                                            │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/pyarrow/pandas_compat │
│ .py:611 in dataframe_to_arrays                                                                   │
│                                                                                                  │
│    608 │   │   │   │   issubclass(arr.dtype.type, np.integer))                                   │
│    609 │                                                                                         │
│    610 │   if nthreads == 1:                                                                     │
│ ❱  611 │   │   arrays = [convert_column(c, f)                                                    │
│    612 │   │   │   │     for c, f in zip(columns_to_convert, convert_fields)]                    │
│    613 │   else:                                                                                 │
│    614 │   │   arrays = []                                                                       │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/pyarrow/pandas_compat │
│ .py:611 in <listcomp>                                                                            │
│                                                                                                  │
│    608 │   │   │   │   issubclass(arr.dtype.type, np.integer))                                   │
│    609 │                                                                                         │
│    610 │   if nthreads == 1:                                                                     │
│ ❱  611 │   │   arrays = [convert_column(c, f)                                                    │
│    612 │   │   │   │     for c, f in zip(columns_to_convert, convert_fields)]                    │
│    613 │   else:                                                                                 │
│    614 │   │   arrays = []                                                                       │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/pyarrow/pandas_compat │
│ .py:598 in convert_column                                                                        │
│                                                                                                  │
│    595 │   │   │   │   pa.ArrowTypeError) as e:                                                  │
│    596 │   │   │   e.args += ("Conversion failed for column {!s} with type {!s}"                 │
│    597 │   │   │   │   │      .format(col.name, col.dtype),)                                     │
│ ❱  598 │   │   │   raise e                                                                       │
│    599 │   │   if not field_nullable and result.null_count > 0:                                  │
│    600 │   │   │   raise ValueError("Field {} was non-nullable but pandas column "               │
│    601 │   │   │   │   │   │   │    "had {} null values".format(str(field),                      │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/pyarrow/pandas_compat │
│ .py:592 in convert_column                                                                        │
│                                                                                                  │
│    589 │   │   │   type_ = field.type                                                            │
│    590 │   │                                                                                     │
│    591 │   │   try:                                                                              │
│ ❱  592 │   │   │   result = pa.array(col, type=type_, from_pandas=True, safe=safe)               │
│    593 │   │   except (pa.ArrowInvalid,                                                          │
│    594 │   │   │   │   pa.ArrowNotImplementedError,                                              │
│    595 │   │   │   │   pa.ArrowTypeError) as e:                                                  │
│                                                                                                  │
│ in pyarrow.lib.array:340                                                                         │
│                                                                                                  │
│ in pyarrow.lib._ndarray_to_array:86                                                              │
│                                                                                                  │
│ in pyarrow.lib.check_status:91                                                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ArrowTypeError: ("Expected bytes, got a 'UUID' object", 'Conversion failed for column my_uuid with type
object')

it works well for to_pandas()

In [8]: t1[t1.my_uuid].to_pandas()
Out[8]:
                                  my_uuid
0    3f661a76-2d0e-4622-862e-1c4adcfd4813
1    3f661a76-2d0e-4622-862e-1c4adcfd4813
2    3f661a76-2d0e-4622-862e-1c4adcfd4813
3    3f661a76-2d0e-4622-862e-1c4adcfd4813
4    3f661a76-2d0e-4622-862e-1c4adcfd4813
..                                    ...
339  3f661a76-2d0e-4622-862e-1c4adcfd4813
340  3f661a76-2d0e-4622-862e-1c4adcfd4813
341  3f661a76-2d0e-4622-862e-1c4adcfd4813
342  3f661a76-2d0e-4622-862e-1c4adcfd4813
343  3f661a76-2d0e-4622-862e-1c4adcfd4813

What version of ibis are you using?

8.0.0

What backend(s) are you using, if any?

duckdb, sqlite

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

jcrist · 2024-03-04T04:10:04Z

Thanks for opening this. Issue 1 should be fixed in #8535.

Issue 2 is due to the to_pyarrow conversion path in sqlite (and a few other backends) going dbapi row -> pandas -> pyarrow. When returning pandas dataframes from to_pandas we currently map a UUID column to an object dtype series of uuid.UUID objects (and these objects fail when converting to pyarrow). In contrast, for to_pyarrow we return a string column with the same data.

The easiest (and I think most consistent) fix would be to stop returning uuid columns in to_pandas as uuid.UUID values and instead treat them as strings. This matches what we do for both polars and pyarrow outputs. It's also more efficient for the user since they don't have an object dtype series in the output series.

cc @cpcloud for a 👍 / 👎 before I implement this fix.

kszucs · 2024-03-04T07:59:31Z

Eventually we should simplify the pandas output until .to_pyarrow().to_pandas() to offload all the conversion duties to arrow. So it is a +1 from me.

) Fixes part of #8532.

cpcloud · 2024-03-04T18:40:22Z

Seems fine. I don't like that we have to do this but the alternative of implementing a custom pyarrow type seems less desirable than converting to strings.

cpcloud · 2024-08-10T15:07:08Z

The repeated UUID issue has been addressed:

In [4]: import ibis
   ...: ibis.options.interactive = True
   ...: from ibis.expr.api import row_number, uuid, now, pi
   ...:
   ...: ibis.set_backend("sqlite")
   ...: t = ibis.examples.penguins.fetch()
   ...: t.mutate(uuid=ibis.uuid()).to_pandas()
Out[4]:
       species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex  year                                  uuid
0       Adelie  Torgersen            39.1           18.7              181.0       3750.0    male  2007  f3102c7e-167c-4854-af20-d3729580e2cc
1       Adelie  Torgersen            39.5           17.4              186.0       3800.0  female  2007  86afae37-f0e5-48d7-ba13-aa701374d4cd
2       Adelie  Torgersen            40.3           18.0              195.0       3250.0  female  2007  665cbe36-d7b7-4a7e-bd5f-ebf0c043c72f
3       Adelie  Torgersen             NaN            NaN                NaN          NaN    None  2007  a740b304-0a13-4f89-bdb4-fa9475f2daa4
4       Adelie  Torgersen            36.7           19.3              193.0       3450.0  female  2007  a8263a30-9cbb-4175-94e9-1429a6fdb0fa
..         ...        ...             ...            ...                ...          ...     ...   ...                                   ...
339  Chinstrap      Dream            55.8           19.8              207.0       4000.0    male  2009  112e7fcc-bf14-4177-96ee-526d9343c368
340  Chinstrap      Dream            43.5           18.1              202.0       3400.0  female  2009  c5964727-4dc0-42dd-8039-1527bd37b673
341  Chinstrap      Dream            49.6           18.2              193.0       3775.0    male  2009  a3d1a137-5847-4309-90c4-59d0f8fe35f9
342  Chinstrap      Dream            50.8           19.0              210.0       4100.0    male  2009  33da8b0b-d368-442c-ba05-44daa037b1e0
343  Chinstrap      Dream            50.2           18.7              198.0       3775.0  female  2009  de5b1031-de4a-4c13-85a6-920e4741f922

fixes ibis-project#8532 and ibis-project#8902

double-thinker · 2024-09-02T19:44:23Z

I think there is a simple and self-contained solution to UUID types and any other type that does not have a 1:1 mapping between Ibis and PyArrow, such as MAC addresses, UUIDs, etc.

Let me know what you think. I am willing to add tests and make a PR if this approach makes sense.

This commit only checks for bijectivity between Ibis and PyArrow types, and if any column of the schema does not comply, then it is cast server-side.

There are some pros:

Ibis does not need to handle special cases backend by backend. PyArrowType mappings define this behavior for all backends that use PyArrow conversion.
It could easily extend to future types.
"Type-loss" conversions could be cast into other types that are not dt.string, although currently the only relevant ones are mapped to string.
Sequential expr.to_pyarrow().to_X() will be consistent with expr.to_X().
It does not require pandas types, so making Ibis more "pandas-agnostic" seems important based on feat: support UUIDs to pyarrow on more backends #8901 conversations.
This affects only operations that previously would fail. It does not change any defined behavior.

Cons:

Without a warning, there is an implicit casting, but without any data loss.
It does not keep any metadata if the user converts them back to an Ibis table. I know there are some conversations about extending types with metadata, so I guess there is no "right way" to do this yet.
The cast is done server-side, which could be unexpected sometimes, but IMO this is a pro since Ibis is agnostic to the backend: if there is no explicit 1:1 conversion to a PyArrow type, then the backend's casting is the best (maybe only?) defined behavior we can rely on.

double-thinker · 2024-09-04T13:36:54Z

@cpcloud Do you think the approach I developed is worthwhile? I can submit a PR in the following days if there's interest.

kylebarron · 2024-09-16T13:46:23Z

At least for UUID, since Arrow has a canonical extension type for it, it seems like that would be a great way to maintain the type information.

kylebarron · 2024-09-16T13:46:49Z

At least for UUID, since Arrow has a canonical extension type for it, it seems like that would be a great way to maintain the type information.

jitingxu1 added the bug Incorrect behavior inside of ibis label Mar 4, 2024

github-project-automation bot added this to Ibis planning and roadmap Mar 4, 2024

github-project-automation bot moved this to backlog in Ibis planning and roadmap Mar 4, 2024

jcrist mentioned this issue Mar 4, 2024

fix(sqlite): ensure ibis.uuid() generates a unique uuid per row #8535

Merged

kszucs pushed a commit that referenced this issue Mar 4, 2024

fix(sqlite): ensure ibis.uuid() generates a unique uuid per row (#8535

c097a2d

) Fixes part of #8532.

jcrist linked a pull request Mar 4, 2024 that will close this issue

fix: return UUID types as strings from .execute()/.to_pandas() #8538

Open

jitingxu1 mentioned this issue Mar 4, 2024

feat: add uuid #8539

Closed

jcrist self-assigned this Mar 4, 2024

jcrist moved this from backlog to cooking in Ibis planning and roadmap Mar 4, 2024

cpcloud changed the title ~~bug: same uuid for all rows and got arrow type error when show data with 'UUID' object~~ bug: arrow type error when show data with 'UUID' object Aug 10, 2024

cpcloud mentioned this issue Aug 11, 2024

bug: postgres.to_pyarrow(ibis.uuid()) errors #8902

Closed

1 task

double-thinker added a commit to double-thinker/ibis that referenced this issue Sep 2, 2024

fix(core): cast missing types in pyarrow consistenly with ibis mapping.

63b6ff5

fixes ibis-project#8532 and ibis-project#8902

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: arrow type error when show data with 'UUID' object #8532

bug: arrow type error when show data with 'UUID' object #8532

jitingxu1 commented Mar 4, 2024

jcrist commented Mar 4, 2024

kszucs commented Mar 4, 2024

cpcloud commented Mar 4, 2024 •

edited

Loading

cpcloud commented Aug 10, 2024

double-thinker commented Sep 2, 2024 •

edited

Loading

double-thinker commented Sep 4, 2024

kylebarron commented Sep 16, 2024

kylebarron commented Sep 16, 2024

bug: arrow type error when show data with 'UUID' object #8532

bug: arrow type error when show data with 'UUID' object #8532

Comments

jitingxu1 commented Mar 4, 2024

What happened?

What version of ibis are you using?

What backend(s) are you using, if any?

Relevant log output

Code of Conduct

jcrist commented Mar 4, 2024

kszucs commented Mar 4, 2024

cpcloud commented Mar 4, 2024 • edited Loading

cpcloud commented Aug 10, 2024

double-thinker commented Sep 2, 2024 • edited Loading

double-thinker commented Sep 4, 2024

kylebarron commented Sep 16, 2024

kylebarron commented Sep 16, 2024

cpcloud commented Mar 4, 2024 •

edited

Loading

double-thinker commented Sep 2, 2024 •

edited

Loading