Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't write an Arrow table if it contains list #606

Open
timspro opened this issue Sep 18, 2024 · 4 comments
Open

Can't write an Arrow table if it contains list #606

timspro opened this issue Sep 18, 2024 · 4 comments

Comments

@timspro
Copy link

timspro commented Sep 18, 2024

I'm expecting the following code to work but am getting an error "RuntimeError: unreachable" when running in Node.js v20.17.0, thrown by fromIPCStream().

import { tableFromArrays, tableToIPC } from "apache-arrow"
import { Table } from "parquet-wasm"

const table = tableFromArrays({
  column: [[1, 2], [3, 4]],
})
const ipc = tableToIPC(table, "stream")
Table.fromIPCStream(ipc)

I tried changing "stream" to "file" but that didn't work either with the error "Io error: failed to fill whole buffer".

I was able to get other examples working locally that didn't have a list (for example, column: [1, 2] and column: [{a: 1}, {a: 2}]).

It does work if using typed arrays: column: [new Int32Array([1, 2]), new Int32Array([3, 4])]. So, I do have a workaround. However, I originally wanted to write a list of structs with Int32 values and now will have to do a struct of typed arrays. Perhaps that is what is intended.

@kylebarron
Copy link
Owner

If you compile with --debug flag turned on, then you can see the actual Rust error, instead of just RuntimeError: unreachable.

With the test in #607, the error is:

stderr | tests/js/index.test.ts > should read IPC stream correctly
panicked at /Users/kyle/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-ipc-53.0.0/src/convert.rs:98:30:
called `Option::unwrap()` on a `None` value

So the rust code is panicking on this line: https://github.com/apache/arrow-rs/blob/5414f1d7c0683c64d69cf721a83c17d677c78a71/arrow-ipc/src/convert.rs#L98

If we load this data in pyarrow, we see:

In [1]: import pyarrow as pa

In [3]: pa.ipc.open_stream("data.arrows").read_all()
Out[3]:
pyarrow.Table
column: list<: double>
  child 0, : double
----
column: [[[1,2],[3,4]]]

So the list's inner field does not have a name set. I'm not sure if that's allowed by the spec (it's rare at least). Either the JS IPC writer or the Rust IPC reader is incorrect.

@kylebarron
Copy link
Owner

I checked with @jorisvandenbossche and saw that the IPC spec doesn't require a name to be set, so this is an issue on the Rust side. (Though there should be a default name set)

@kylebarron
Copy link
Owner

Created apache/arrow-rs#6415. Otherwise, you can work around this by manually setting a field name for any inner lists.

@timspro
Copy link
Author

timspro commented Sep 18, 2024

Thanks for the commentary. The type inference done be tableFromArrays() is passing the empty name: https://github.com/apache/arrow/blob/main/js/src/factories.ts#L153.

I was then able to get around the issue by passing in the List type directly:

import { Field, Int32, List, tableFromArrays, tableToIPC, vectorFromArray } from "apache-arrow"
import { Table } from "parquet-wasm"

const table = tableFromArrays({
  column: vectorFromArray(
    [[1, 2], [3, 4]],
    new List(new Field("_", new Int32())) // fails if "" passed instead
  ),
})
const ipc = tableToIPC(table, "stream")
Table.fromIPCStream(ipc)

This is a fine workaround for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants