Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ARRAY support to PostgresqlReader #264

Closed
criccomini opened this issue Jun 14, 2023 · 11 comments · Fixed by #410
Closed

Add ARRAY support to PostgresqlReader #264

criccomini opened this issue Jun 14, 2023 · 11 comments · Fixed by #410

Comments

@criccomini
Copy link
Contributor

Recap's PostgresqlReader does not currently support PG arrays as documented here:

https://www.postgresql.org/docs/current/arrays.html

The reader should be updated to support this type.

@criccomini
Copy link
Contributor Author

Arrays in PG are tricky because they don't contain a type in information_schema.columns and they can have an arbitrary number of dimensions (int[] is not one dimensional; any number of dimensions can be stored in it).

https://www.postgresql.org/docs/current/arrays.html

@alexdemeo
Copy link
Contributor

alexdemeo commented Dec 18, 2023

Hey Chris. I'm using recap and looking to leverage array functionality with postgres. I encountered this issue and had some questions about implementing this functionality. It seems it's possible to derive the innermost type of from UDT_NAME (slicing off the underscore prefix, eg: "UDT_NAME": "_int4"). Does that sound accurate? As far as the multidimensional array issue, do you see any path forward with still being able to add array support ambiguous to the dimensionality?

@criccomini
Copy link
Contributor Author

Hi! I'll take a look at this tomorrow (hopefully). Spent all day fighting Hive and I'm out of gas now. :P

@criccomini
Copy link
Contributor Author

Was reading over this just now.

data_type character_data
Data type of the column, if it is a built-in type, or ARRAY if it is some array (in that case, see the view element_types), ...

So the element_types view seems promising. I'm curious how it handles multi-dimensional arrays, if at all. Need to poke around more.

@criccomini
Copy link
Contributor Author

After Googling around a bit more, it still seems like there's no way to get the number of dimensions from ARRAY columns. And as you pointed out, UDT_NAME in the columns view has what we need for the types.

Next question is how to represent this in Recap. Maybe:

type: list
name: baz
alias: foo.bar.Baz
values:
    - type: int32
    - type: foo.bar.Baz

This defines a list that can have either int32 (if the column has udf_name=_int4) or a list of the same. The cyclic reference makes it possible to have arbitrarily deep nesting as well as mixed lengths. I think this is what PG supports.

WDYT? The downside to this (IMO) is that it's pretty complex to deal with for the user. But it is actually what you can get from PG, so...

@alexdemeo
Copy link
Contributor

I think this makes sense! It's unfortunate dimensionality isn't a part of the schema, but given that I agree cyclic reference is way to go. PG doesn't support mixed lengths within a particular dimension, but I don't think there's a good user-friendly way that represent that constraint.

@criccomini
Copy link
Contributor Author

So, another observation. For types like varchar(255)[], there's no way in PG to get the 255 octet length back. It allows you to define tables that way, but I don't think it's enforced?

@criccomini
Copy link
Contributor Author

This is interesting, from ChatGPT:

Screenshot 2023-12-20 at 11 38 25 AM

I was wondering how PostgreSQL handles bit(8)[] columns. It sounds like it doesn't covert it to bit varying[] the way it does with VARCHAR. But it also doens't use CHARACTER_MAXIMUM_LENGTH the way bit(8) columns would.

@alexdemeo
Copy link
Contributor

Oh interesting, it sounds like all this is actually doable with some edge-casing. We can also use pg_attribute.attndims to get the dimensionality. I guess it is stored, just not enforced.

criccomini added a commit that referenced this issue Dec 20, 2023
Recap's PostgreSQL converter now supports ARRAY column types. The
implementation uses `UDT_NAME` in `information_schema.columns` to get the type
of the array elements.

There are some weird behaviors with the implementation because of the way
PostgreSQL handles arrays. Importantly, the element size is ignored (int(4)[]
is just treated as int[]). PostgreSQL also treats all arrays as having an
arbitrary number of dimensions, so int[] is the same as int[][], and so on.

See #264 for more discussion on the
nuances of this implementation.

Closes #264
@criccomini
Copy link
Contributor Author

Here's my PR right now:

#410

I am not planning to implement the pg_attribute stuff right now. It just uses maximal defaults. If you want to submit a follow-on PR to add that support, you're welcome to. :)

criccomini added a commit that referenced this issue Dec 20, 2023
Recap's PostgreSQL converter now supports ARRAY column types. The
implementation uses `UDT_NAME` in `information_schema.columns` to get the type
of the array elements.

There are some weird behaviors with the implementation because of the way
PostgreSQL handles arrays. Importantly, the element size is ignored (int(4)[]
is just treated as int[]). PostgreSQL also treats all arrays as having an
arbitrary number of dimensions, so int[] is the same as int[][], and so on.

See #264 for more discussion on the
nuances of this implementation.

Closes #264
criccomini added a commit that referenced this issue Dec 20, 2023
Recap's PostgreSQL converter now supports ARRAY column types. The
implementation uses `UDT_NAME` in `information_schema.columns` to get the type
of the array elements.

There are some weird behaviors with the implementation because of the way
PostgreSQL handles arrays. Importantly, the element size is ignored (int(4)[]
is just treated as int[]). PostgreSQL also treats all arrays as having an
arbitrary number of dimensions, so int[] is the same as int[][], and so on.

See #264 for more discussion on the
nuances of this implementation.

Closes #264
criccomini added a commit that referenced this issue Dec 20, 2023
Recap's PostgreSQL converter now supports ARRAY column types. The
implementation uses `UDT_NAME` in `information_schema.columns` to get the type
of the array elements.

There are some weird behaviors with the implementation because of the way
PostgreSQL handles arrays. Importantly, the element size is ignored (int(4)[]
is just treated as int[]). PostgreSQL also treats all arrays as having an
arbitrary number of dimensions, so int[] is the same as int[][], and so on.

See #264 for more discussion on the
nuances of this implementation.

Closes #264
criccomini added a commit that referenced this issue Dec 20, 2023
Recap's PostgreSQL converter now supports ARRAY column types. The
implementation uses `UDT_NAME` in `information_schema.columns` to get the type
of the array elements.

There are some weird behaviors with the implementation because of the way
PostgreSQL handles arrays. Importantly, the element size is ignored (int(4)[]
is just treated as int[]). PostgreSQL also treats all arrays as having an
arbitrary number of dimensions, so int[] is the same as int[][], and so on.

See #264 for more discussion on the
nuances of this implementation.

Closes #264
@criccomini
Copy link
Contributor Author

@alexdemeo recap-core 0.9.5 is on pypi with the latest changes for PostgreSQL arrays. Let me know if you have any issues with it. I'll try and update the docs this afternoon.

criccomini pushed a commit that referenced this issue Jan 3, 2024
Add array support to postgres reader/converter with dimensionality read
from `pg_attribute.attndims`. This is grabbed by joining on
`information_schema.columns.column_name = pg_attribute.attname`

This a followup to #410 which
lays a tiny bit of groundwork for
#264 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants