Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: BIGQUERY backend generates invalid query when calling .distinct on subset on table with array column #10553

Open
1 task done
greg-offerfit opened this issue Dec 4, 2024 · 1 comment
Labels
bug Incorrect behavior inside of ibis

Comments

@greg-offerfit
Copy link

What happened?

Given a GBQ table with the following schema:

id: int
int_ids: array<!int64>  (int64 REPEATED)

Calling table.distinct(on=["id"]).execute() fails with the following error:

google.api_core.exceptions.BadRequest: 400 The argument to ARRAY_AGG must not be an array type but was ARRAY at [7:5]; reason: invalidQuery, location: query, message: The argument to ARRAY_AGG must not be an array type but was ARRAY at [7:5]

I expected this to generate a valid query and return a DataFrame.

See below comment for minimal reproduction.

What version of ibis are you using?

9.5.0

What backend(s) are you using, if any?

BigQuery

Relevant log output

google.api_core.exceptions.BadRequest: 400 The argument to ARRAY_AGG must not be an array type but was ARRAY<INT64> at [7:5]; reason: invalidQuery, location: query, message: The argument to ARRAY_AGG must not be an array type but was ARRAY<INT64> at [7:5]

Code of Conduct

  • I agree to follow this project's Code of Conduct
@greg-offerfit greg-offerfit added the bug Incorrect behavior inside of ibis label Dec 4, 2024
@greg-offerfit
Copy link
Author

TO REPRODUCE:

Download the below json file and save it somewhere accessible by Python/Ibis prompt. The json file is just a single row that generates a GBQ table with the appropriate schema and values

Run the below code (or equivalent to you):

import ibis

con = ibis.bigquery.connect()  # INSERT ANY PROJECT/DATASET ARGS YOU NEED HERE
table = con.read_json(PATH_TO_DOWNLOADED_JSON_FILE)
dedupe = table.distinct(on=["id"]).execute()  # ERRORS HERE

The generated query is:

SELECT `t1`.`int_ids`, `t1`.`id` FROM (SELECT `t0`.`id`, ARRAY_AGG(`t0`.`int_ids` IGNORE NULLS LIMIT 1)[safe_offset(0)] AS `int_ids` FROM `TABLE` AS `t0` GROUP BY 1) AS `t1`

Which GBQ marks as invalid with the following error:
The argument to ARRAY_AGG must not be an array type but was ARRAY<INT64> at [1:58]

array_sample.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis
Projects
Status: backlog
Development

No branches or pull requests

1 participant