Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adds Schema class and modifies schema handling #2081

Open
wants to merge 25 commits into
base: pangea-v1alpha
Choose a base branch
from

Conversation

chalmerlowe
Copy link
Collaborator

@chalmerlowe chalmerlowe commented Nov 27, 2024

This PR adds a Schema class to python-bigquery as part of the Google internal PANGEA work.

Historically, table.schema has been a list of SchemaFields.

Schema objects will now include not only a list of fields, but will have an attribute called foreign_type_info.

The aim of the work here is create, as much as possible, a drop-in replacement for schema.

The end user should not need to know whether they are looking at a list-based schema OR a class Schema-based object unless they need access to the foreign_type_info attr.

They should still be able to:

  • iterate over a Schema just like you would a list
  • index and slice the Schema just like a list

For users who already have a list in their code, nothing should break.

For future users who will need a Schema object, they should be able to instantiate one and it should just work.

TODO:

  • add docstrings
  • add type hints
  • remove or modify comment strings in the code to help future maintainers

@chalmerlowe chalmerlowe requested review from a team as code owners November 27, 2024 19:00
@chalmerlowe chalmerlowe requested a review from hongalex November 27, 2024 19:00
Copy link

conventional-commit-lint-gcf bot commented Nov 27, 2024

🤖 I detect that the PR title and the commit message differ and there's only one commit. To use the PR title for the commit history, you can use Github's automerge feature with squashing, or use automerge label. Good luck human!

-- conventional-commit-lint bot
https://conventionalcommits.org/

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery API. labels Nov 27, 2024
@chalmerlowe chalmerlowe assigned chalmerlowe and unassigned obada-ab Dec 2, 2024
@chalmerlowe chalmerlowe added kokoro:run Add this label to force Kokoro to re-run the tests. kokoro:force-run Add this label to force Kokoro to re-run the tests. and removed kokoro:run Add this label to force Kokoro to re-run the tests. labels Dec 4, 2024
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Dec 5, 2024
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Dec 6, 2024
@@ -3680,7 +3680,7 @@ def insert_rows(
if selected_fields is not None:
schema = selected_fields

if len(schema) == 0:
if not schema:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, could you make len(schema_object) work? I guess this is to allow None as schema now, too? If table.schema could indeed return None now, that might be considered a breaking change, depending on how strict we want to be about types.

Either way, folks might have to change their code to deal with missing schema, so something to consider. Maybe we return an object of len(...) == 0 for now and have a warning that it will be changed to None in future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike the idiom of schema-less tables, but it is indeed something we've seen people do intentionally in the wild. My recollection is that it was used historically as a signal for some workflow state (e.g. some kind of is-table-present check).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that Table.schema can currently be set to None. It appears that we allow for that in the existing code:

https://github.com/googleapis/python-bigquery/blob/d4d39acb8574f0d06d4e490b859e5fe6b57d0d9e/google/cloud/bigquery/table.py#L458C1-L466C84

@schema.setter
def schema(self, value):
    api_field = self._PROPERTY_TO_API_FIELD["schema"]
    if value is None:
        self._properties[api_field] = None
    else:
        value = _to_schema_fields(value)
        self._properties[api_field] = {"fields": _build_schema_resource(value)}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched it back to the old check.

@@ -4029,7 +4029,7 @@ def list_rows(

# No schema, but no selected_fields. Assume the developer wants all
# columns, so get the table resource for them rather than failing.
elif len(schema) == 0:
elif not schema:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Let's be careful about breaking changes regarding table.schema allowing None.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also switched this back to the old check.

Exception: If ``schema`` is not a sequence, or if any item in the
sequence is not a :class:`~google.cloud.bigquery.schema.SchemaField`
instance or a compatible mapping representation of the field.
"""TODO docstring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging with a comment just so we don't lose track of this TODO.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -796,8 +792,6 @@ def serde_info(self) -> Any:
prop = _get_sub_prop(self._properties, ["serDeInfo"])
if prop is not None:
prop = StorageDescriptor().from_api_repr(prop)
print(f"DINOSAUR prop: {prop}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol, oops! 😆 Could be worth a separate cleanup PR to have something small and easy to approve.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

self._properties["foreignTypeInfo"] = value

@property
def _fields(self) -> Any:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why private? I guess because we're trying to emulate a list and want to support mutation that way?

Would be worth an explanation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, the idea behind having it be private was to try and emulate a list.
I can add a note to that effect when I add docstrings.

return _parse_schema_resource(prop)
elif isinstance(prop, Schema):
return prop
return _parse_schema_resource(prop)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably we want to return a Schema object now.

Suggested change
return _parse_schema_resource(prop)
return Schema.from_api_repr(prop)

Aside: Is _parse_schema_resource dead code? I assume it still returns a list of SchemaField?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_parse_schema_resource still returns a list of SchemaFields for backwards compatibility.

Where possible I -tried- to stay consist about what the user provides versus gets back: if someone handed the library a list of SchemaFields (i.e. cause that is what they are used to) then we maintained that state. I tried to avoid randomly converting lists into Schema objects to avoid user surprise.


@schema.setter
def schema(self, value):
api_field = self._PROPERTY_TO_API_FIELD["schema"]

if value is None:
self._properties[api_field] = None
elif isinstance(value, Schema):
self._properties[api_field] = value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._properties[api_field] = value
self._properties[api_field] = value.to_api_repr()

Dict[str, Any]:
A dictionary in the format used by the BigQuery API.
"""
return copy.deepcopy(self._properties)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a bit wasteful to make deepcopy here and in from_api_repr. Indeed it's safer, but could add a lot of overhead. IIRC we actually removed some deepcopy calls from SchemaField because it was slowing down customers who build dynamic schemas in their code.

See: #6 and #26

An instance of the class initialized with data from 'resource'.
"""
config = cls("")
config._properties = copy.deepcopy(resource)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. May be better to omit the deepcopy.

@@ -452,17 +453,20 @@ def schema(self):
instance or a compatible mapping representation of the field.
"""
prop = self._properties.get(self._PROPERTY_TO_API_FIELD["schema"])
if not prop:
if not prop: # if empty Schema, empty list, None
return []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we return an "empty" Schema here for consistency?

Suggested change
return []
return Schema()

@tswast
Copy link
Contributor

tswast commented Dec 6, 2024

Thanks! Overall goals outlined in the description look good to me.

The aim of the work here is create, as much as possible, a drop-in replacement for schema.

The end user should not need to know whether they are looking at a list-based schema OR a class Schema-based object unless they need access to the foreign_type_info attr.

They should still be able to:

  • iterate over a Schema just like you would a list
  • index and slice the Schema just like a list

For users who already have a list in their code, nothing should break.

For future users who will need a Schema object, they should be able to instantiate one and it should just work.

@_fields.setter
def _fields(self, value: list) -> None:
value = _isinstance_or_raise(value, list, none_allowed=True)
self._properties["_fields"] = value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why _fields. It's fields in the REST API. Also, this should convert to the list to API representation. https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#TableSchema

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed _fields to fields. Changed tests to match.

yu-iskw and others added 12 commits January 2, 2025 18:37
* feat: add property for maxStaleness in table definitions

Signed-off-by: Yu Ishikawa <[email protected]>

* Update google/cloud/bigquery/table.py

---------

Signed-off-by: Yu Ishikawa <[email protected]>
Co-authored-by: Lingqing Gan <[email protected]>
* add type hints

* Update client.py

Moves import from being used solely during specific checks to being more universally available.

* Update google/cloud/bigquery/client.py

* Update client.py

testing some minor changes to deal with mypy quirks

* Update google/cloud/bigquery/client.py

---------

Co-authored-by: Chalmer Lowe <[email protected]>
Source-Link: googleapis/synthtool@e808c98
Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:8e3e7e18255c22d1489258d0374c901c01f9c4fd77a12088670cd73d580aa737

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.4 to 3.1.5.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](pallets/jinja@3.1.4...3.1.5)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…hemaField` (#2097)

* feat: preserve unknown fields from the REST API representaton in `SchemaField`

* 🦉 Updates from OwlBot post-processor

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

* remove unnecessary variable

* remove unused private method

* fix pytype

---------

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
* fix: adds test of roundingmode as a str

* 🦉 Updates from OwlBot post-processor

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

---------

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
@chalmerlowe chalmerlowe added kokoro:run Add this label to force Kokoro to re-run the tests. and removed kokoro:run Add this label to force Kokoro to re-run the tests. labels Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants