Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(parquet): Avoid SEGV if table column type does not match file column type #12350

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

czentgr
Copy link
Collaborator

@czentgr czentgr commented Feb 15, 2025

If a user defines a table, for example in Hive, where the column types don’t match the file column types SEGV might occur. Specifically, the SEGV has been observed if the parquet file contains a VARCHAR column but the table defines an INTEGER column instead and the data is accessed via TableScan using a basic select * .

The resulting vector is of type string but it doesn’t match the table metadata and can cause a SEGV in the PartitionedOutput operator.

This also prevents issues and error coming from the readers when they encounter types that are not part of the switch, for example, defining a VARCHAR type column when the file column is an INTEGER.

Fixes: #12349

…mn type

If a user defines a table, for example in Hive, where the column types don’t match
the file column types SEGV might occur. Specifically, the SEGV has been observed
if the parquet file contains a VARCHAR column but the table defines an INTEGER
column instead and the data is accessed via TableScan using a basic select * .

The resulting vector is of type string but it doesn’t match the table metadata and
can cause a SEGV in the PartitionedOutput operator.

This also prevents issues and error coming from the readers when they encounter
types that are not part of the switch, for example, defining a VARCHAR type column
when the file column is an INTEGER.
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 15, 2025
Copy link

netlify bot commented Feb 15, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit badb13e
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67b00fa8b3700600088396a3

@czentgr
Copy link
Collaborator Author

czentgr commented Feb 15, 2025

New E2E test output instead of SEGV:

presto:default> select var from segv_int_var;

Query 20250215_035213_00000_xqdxq, FAILED, 1 node
Splits: 2 total, 0 done (0.00%)
[Latency: client-side: 0:01, server-side: 0:01] [0 rows, 0B] [0 rows/s, 0B/s]

Query 20250215_035213_00000_xqdxq failed: veloxType->equivalent(*requestedType) Converted type VARCHAR does not match the requested type INTEGER Split [Hive: file:/Users/czentgr/presto/data/hive_data/default/segv_int_var/20250214_032315_00002_jukim.1.0.0.0_0_5_3fb8b160-a7fe-4c52-855a-1c77c857333c.parquet 0 - 354] Task 20250215_035213_00000_xqdxq.1.0.0.0 Operator: TableScan[0] 0

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @czentgr

@majetideepak majetideepak marked this pull request as ready for review February 17, 2025 14:39
@majetideepak majetideepak changed the title fix(parquet): Fix SEGV if table column type does not match file column type fix(parquet): Avoid SEGV if table column type does not match file column type Feb 17, 2025
@majetideepak majetideepak added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Feb 17, 2025
// if provided.
if (requestedType) {
VELOX_CHECK(
veloxType->equivalent(*requestedType),
Copy link
Contributor

@Yuhta Yuhta Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not need to be exact match, some schema evolution should be supported (e.g. INTEGER->BIGINT). You should add individual checks inside convertType() for each converted type.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense! We should confirm what schema evolution is currently supported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At file level we support these:

  • Struct field rename
  • Additional field at end of struct
  • Type widening: all integer types can be widen; REAL can be widen to DOUBLE; there is no conversion between floating point types and integer types

Check out TableEvolutionFuzzer to see some examples (ideally we want enable it for Parquet as well): https://github.com/facebookincubator/velox/blob/main/velox/exec/tests/TableEvolutionFuzzerTest.cpp#L140

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Struct field rename PR is here #5962
We need an overall design for schema evolution.

In this PR, we should at least throw a reasonable unsupported error instead of SEGV.

You should add individual checks inside convertType() for each converted type.

Let's do this as a starting point. We can error out for unsupported schema evolution.

@Yuhta Yuhta removed the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Feb 17, 2025
@facebook-github-bot
Copy link
Contributor

@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SEGV in reading parquet data if column type mismatch occurs
4 participants