-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(parquet): Avoid SEGV if table column type does not match file column type #12350
base: main
Are you sure you want to change the base?
Conversation
…mn type If a user defines a table, for example in Hive, where the column types don’t match the file column types SEGV might occur. Specifically, the SEGV has been observed if the parquet file contains a VARCHAR column but the table defines an INTEGER column instead and the data is accessed via TableScan using a basic select * . The resulting vector is of type string but it doesn’t match the table metadata and can cause a SEGV in the PartitionedOutput operator. This also prevents issues and error coming from the readers when they encounter types that are not part of the switch, for example, defining a VARCHAR type column when the file column is an INTEGER.
✅ Deploy Preview for meta-velox canceled.
|
New E2E test output instead of SEGV:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @czentgr
// if provided. | ||
if (requestedType) { | ||
VELOX_CHECK( | ||
veloxType->equivalent(*requestedType), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not need to be exact match, some schema evolution should be supported (e.g. INTEGER->BIGINT). You should add individual checks inside convertType()
for each converted type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense! We should confirm what schema evolution is currently supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At file level we support these:
- Struct field rename
- Additional field at end of struct
- Type widening: all integer types can be widen; REAL can be widen to DOUBLE; there is no conversion between floating point types and integer types
Check out TableEvolutionFuzzer
to see some examples (ideally we want enable it for Parquet as well): https://github.com/facebookincubator/velox/blob/main/velox/exec/tests/TableEvolutionFuzzerTest.cpp#L140
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Struct field rename PR is here #5962
We need an overall design for schema evolution.
In this PR, we should at least throw a reasonable unsupported error instead of SEGV.
You should add individual checks inside convertType() for each converted type.
Let's do this as a starting point. We can error out for unsupported schema evolution.
@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
If a user defines a table, for example in Hive, where the column types don’t match the file column types SEGV might occur. Specifically, the SEGV has been observed if the parquet file contains a VARCHAR column but the table defines an INTEGER column instead and the data is accessed via TableScan using a basic select * .
The resulting vector is of type string but it doesn’t match the table metadata and can cause a SEGV in the PartitionedOutput operator.
This also prevents issues and error coming from the readers when they encounter types that are not part of the switch, for example, defining a VARCHAR type column when the file column is an INTEGER.
Fixes: #12349