[R] [parquet] Cannot load chunked large_list<dict> from parquet file #45087

el-hult · 2024-12-20T10:51:16Z

Describe the bug, including details regarding any error messages, version, and platform.

The R arrow library cannot load a file with schema

schema: codes: large_list<element: dictionary<values=string, indices=int32, ordered=0>>
  child 0, element: dictionary<values=string, indices=int32, ordered=0>

if the table is chunked. To reproduce, run below python script. in an environment that also has R with arrow installed

import pyarrow as pa
import pyarrow.parquet as pq
import subprocess

def test_load_parquet(table,label):
    pq.write_table(table, "t.parquet", row_group_size=1)
    res = subprocess.run(
        ["Rscript", "-e", 'library(arrow);t=arrow::read_parquet("t.parquet");'],
        capture_output=True,
    )
    print(f'{label}\n#####')
    if res.returncode != 0:
        stdErr = res.stderr.decode()
        assert "NotImplemented: Nested data conversions not implemented for chunked array outputs" in stdErr
        print('R  failed')
    else:
        print('R      ok')

    pq.read_table("t.parquet") # no error!
    print("python ok")
    print("schema:",pq.read_schema("t.parquet"))

codes = [["a"],["a"]]
t1 = pa.table({"codes": codes})
t2 = pa.table({"codes": codes}).cast(
    pa.schema({"codes": pa.large_list(pa.dictionary(pa.int32(), pa.string()))})
)
t3 = pa.table({"codes": codes}).cast(
    pa.schema({"codes": pa.list_(pa.dictionary(pa.int32(), pa.string()))})
)
test_load_parquet(t1,'t1')
test_load_parquet(t2,'t2')
test_load_parquet(t3,'t3')

to get the output

t1
#####
R      ok
python ok
schema: codes: list<element: string>
  child 0, element: string
t2
#####
R  failed
python ok
schema: codes: large_list<element: dictionary<values=string, indices=int32, ordered=0>>
  child 0, element: dictionary<values=string, indices=int32, ordered=0>
t3
#####
R  failed
python ok
schema: codes: list<element: dictionary<values=string, indices=int32, ordered=0>>
  child 0, element: dictionary<values=string, indices=int32, ordered=0>

I have verified this is an issue in R library versions 13.0.0.0 and 18.1.0. both list_ and large_list fails.

The error reported by the R library is discussed in #32723 , but since this works in pyarrow, I guess this is a separate issue from the C++ issue.

Component(s)

Parquet, R

The text was updated successfully, but these errors were encountered:

el-hult added the Type: bug label Dec 20, 2024

github-actions bot added Component: R Component: Parquet labels Dec 20, 2024

el-hult changed the title ~~Cannot load chunked large_list<dict> from chunked parquet file~~ [R] [parquet] Cannot load chunked large_list<dict> from parquet file Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] [parquet] Cannot load chunked large_list<dict> from parquet file #45087

[R] [parquet] Cannot load chunked large_list<dict> from parquet file #45087

el-hult commented Dec 20, 2024

[R] [parquet] Cannot load chunked large_list<dict> from parquet file #45087

[R] [parquet] Cannot load chunked large_list<dict> from parquet file #45087

Comments

el-hult commented Dec 20, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)