-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support PG array dimensionality #411
Changes from 6 commits
b6847a7
dbf2a91
e9e13e8
183ee64
09dc949
2cc2753
1c97a1e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,6 +8,7 @@ | |
FloatType, | ||
IntType, | ||
ListType, | ||
NullType, | ||
ProxyType, | ||
RecapType, | ||
RecapTypeRegistry, | ||
|
@@ -24,7 +25,15 @@ | |
|
||
|
||
class PostgresqlConverter(DbapiConverter): | ||
def __init__(self, namespace: str = DEFAULT_NAMESPACE) -> None: | ||
def __init__( | ||
self, | ||
ignore_array_dimensionality: bool = True, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if the negation There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My vote is for 'enforce_array_dimensions' |
||
namespace: str = DEFAULT_NAMESPACE, | ||
): | ||
# since array dimensionality is not enforced by PG schemas: | ||
# if `ignore_array_dimensionality = True` then read arrays irrespective of how many dimensions they have | ||
# if `ignore_array_dimensionality = False` then read arrays as nested lists | ||
self.ignore_array_dimensionality = ignore_array_dimensionality | ||
self.namespace = namespace | ||
self.registry = RecapTypeRegistry() | ||
|
||
|
@@ -34,6 +43,7 @@ def _parse_type(self, column_props: dict[str, Any]) -> RecapType: | |
octet_length = column_props["CHARACTER_OCTET_LENGTH"] | ||
max_length = column_props["CHARACTER_MAXIMUM_LENGTH"] | ||
udt_name = (column_props["UDT_NAME"] or "").lower() | ||
ndims = column_props["ATTNDIMS"] | ||
|
||
if data_type in ["bigint", "int8", "bigserial", "serial8"]: | ||
base_type = IntType(bits=64, signed=True) | ||
|
@@ -102,29 +112,44 @@ def _parse_type(self, column_props: dict[str, Any]) -> RecapType: | |
# * 8 because bit columns use bits not bytes. | ||
"CHARACTER_MAXIMUM_LENGTH": MAX_FIELD_SIZE * 8, | ||
"UDT_NAME": None, | ||
"ATTNDIMS": 0, | ||
} | ||
) | ||
column_name_without_periods = column_name.replace(".", "_") | ||
base_type_alias = f"{self.namespace}.{column_name_without_periods}" | ||
# Construct a self-referencing list comprised of the array's value | ||
# type and a proxy to the list itself. This allows arrays to be an | ||
# arbitrary number of dimensions, which is how PostgreSQL treats | ||
# lists. See https://github.com/recap-build/recap/issues/264 for | ||
# more details. | ||
base_type = ListType( | ||
alias=base_type_alias, | ||
values=UnionType( | ||
types=[ | ||
value_type, | ||
ProxyType( | ||
alias=base_type_alias, | ||
registry=self.registry, | ||
), | ||
], | ||
), | ||
) | ||
self.registry.register_alias(base_type) | ||
if self.ignore_array_dimensionality: | ||
column_name_without_periods = column_name.replace(".", "_") | ||
base_type_alias = f"{self.namespace}.{column_name_without_periods}" | ||
# Construct a self-referencing list comprised of the array's value | ||
# type and a proxy to the list itself. This allows arrays to be an | ||
# arbitrary number of dimensions, which is how PostgreSQL treats | ||
# lists. See https://github.com/recap-build/recap/issues/264 for | ||
# more details. | ||
base_type = ListType( | ||
alias=base_type_alias, | ||
values=UnionType( | ||
types=[ | ||
value_type, | ||
ProxyType( | ||
alias=base_type_alias, | ||
registry=self.registry, | ||
), | ||
], | ||
), | ||
) | ||
self.registry.register_alias(base_type) | ||
else: | ||
base_type = self._create_n_dimension_list(value_type, ndims) | ||
else: | ||
raise ValueError(f"Unknown data type: {data_type}") | ||
|
||
return base_type | ||
|
||
def _create_n_dimension_list(self, base_type: RecapType, ndims: int) -> RecapType: | ||
""" | ||
Build a list type with `ndims` dimensions containing nullable `base_type` as the innermost value type. | ||
""" | ||
if ndims == 0: | ||
return UnionType(types=[NullType(), base_type]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm curious about this one. It seems right, but I'm not 100% sure. As I read it, there are a few things:
I think this is the right behavior. But I'm curious: are PG arrays always allowed NULLs in their dimensional values? I couldn't find good docs on this. I haven't tested it out. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I did some testing and digging, and afaict the answer is yes- the innermost value can always be null. Enforcing non-nulls requires adding some sort of validation to |
||
else: | ||
return ListType( | ||
values=self._create_n_dimension_list(base_type, ndims - 1), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Niiiice. Took me a sec to grok why we didn't need this anymore. Fully walking the n_dimensions means we don't need self-references. Awesome.
One question/nuance here: the PG dimensions are just a suggestion.
https://www.postgresql.org/docs/current/arrays.html
So the question is, do we want to have the Recap reflect the DB's data or its schema? My implementation (with ProxyType) reflected the data. Yours changes it to reflect the schema. Perhaps we want it configurable one as the default? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like to think the schema is the beacon of truth for what the user intends for the column. If users are leveraging the column differently than schema's representation, they should fix the schema. But I could see past mistakes leading to a situation where this isn't true, which would then lead to recap constructing a false narrative about the data. I think making it configurable makes sense. Maybe default to ProxyType since that's the safer assumption? Would we want to add config params to the
PostgresqlConverter
constructor?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, can you add a param to the init to config. Defaulting to proxy is safer, as you say.