Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] [parquet] Cannot load chunked large_list<dict> from parquet file #45087

Open
el-hult opened this issue Dec 20, 2024 · 0 comments
Open

[R] [parquet] Cannot load chunked large_list<dict> from parquet file #45087

el-hult opened this issue Dec 20, 2024 · 0 comments

Comments

@el-hult
Copy link

el-hult commented Dec 20, 2024

Describe the bug, including details regarding any error messages, version, and platform.

The R arrow library cannot load a file with schema

schema: codes: large_list<element: dictionary<values=string, indices=int32, ordered=0>>
  child 0, element: dictionary<values=string, indices=int32, ordered=0>

if the table is chunked. To reproduce, run below python script. in an environment that also has R with arrow installed

import pyarrow as pa
import pyarrow.parquet as pq
import subprocess

def test_load_parquet(table,label):
    pq.write_table(table, "t.parquet", row_group_size=1)
    res = subprocess.run(
        ["Rscript", "-e", 'library(arrow);t=arrow::read_parquet("t.parquet");'],
        capture_output=True,
    )
    print(f'{label}\n#####')
    if res.returncode != 0:
        stdErr = res.stderr.decode()
        assert "NotImplemented: Nested data conversions not implemented for chunked array outputs" in stdErr
        print('R  failed')
    else:
        print('R      ok')

    pq.read_table("t.parquet") # no error!
    print("python ok")
    print("schema:",pq.read_schema("t.parquet"))

codes = [["a"],["a"]]
t1 = pa.table({"codes": codes})
t2 = pa.table({"codes": codes}).cast(
    pa.schema({"codes": pa.large_list(pa.dictionary(pa.int32(), pa.string()))})
)
t3 = pa.table({"codes": codes}).cast(
    pa.schema({"codes": pa.list_(pa.dictionary(pa.int32(), pa.string()))})
)
test_load_parquet(t1,'t1')
test_load_parquet(t2,'t2')
test_load_parquet(t3,'t3')

to get the output

t1
#####
R      ok
python ok
schema: codes: list<element: string>
  child 0, element: string
t2
#####
R  failed
python ok
schema: codes: large_list<element: dictionary<values=string, indices=int32, ordered=0>>
  child 0, element: dictionary<values=string, indices=int32, ordered=0>
t3
#####
R  failed
python ok
schema: codes: list<element: dictionary<values=string, indices=int32, ordered=0>>
  child 0, element: dictionary<values=string, indices=int32, ordered=0>

I have verified this is an issue in R library versions 13.0.0.0 and 18.1.0. both list_ and large_list fails.

The error reported by the R library is discussed in #32723 , but since this works in pyarrow, I guess this is a separate issue from the C++ issue.

Component(s)

Parquet, R

@el-hult el-hult changed the title Cannot load chunked large_list<dict> from chunked parquet file [R] [parquet] Cannot load chunked large_list<dict> from parquet file Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant