Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Filter Parquet pages with ParquetColumnExpr #20714

Merged
merged 14 commits into from
Jan 27, 2025

Conversation

coastalwhite
Copy link
Collaborator

@coastalwhite coastalwhite commented Jan 14, 2025

This PR adds a ParquetColumnExpr which allows predicate filtering while reading Parquet pages. While this is currently implemented with many limitations, this can eventually allow for way more granular filtering of items without having to traverse all pages. This is especially beneficial for equality predicates and predicates over dictionary encoded pages. Another nice side effect is that it should massively reduce the memory consumption for strict queries.

At the moment, this is only triggered if there is a single binary expression with a column on one side and a scalar on the other side.

It can be enabled with by setting the environment variable POLARS_PARQUET_EXPR=1.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Jan 14, 2025
@coastalwhite coastalwhite added the needs-bench Needs a benchmark run label Jan 15, 2025
@coastalwhite coastalwhite marked this pull request as ready for review January 15, 2025 16:03
@coastalwhite coastalwhite changed the title feat: Start working with ParquetColumnExpr feat: Filter Parquet pages with ParquetColumnExpr Jan 15, 2025
Copy link

codecov bot commented Jan 15, 2025

Codecov Report

Attention: Patch coverage is 27.80775% with 1472 lines in your changes missing coverage. Please review.

Project coverage is 79.28%. Comparing base (7ccb3ae) to head (1fecfdc).
Report is 26 commits behind head on main.

Files with missing lines Patch % Lines
...w/read/deserialize/dictionary_encoded/predicate.rs 0.00% 206 Missing ⚠️
...-parquet/src/arrow/read/deserialize/binview/mod.rs 18.51% 110 Missing ⚠️
crates/polars-io/src/predicates.rs 0.00% 100 Missing ⚠️
...et/src/arrow/read/deserialize/binview/predicate.rs 0.00% 83 Missing ⚠️
...rs-parquet/src/arrow/read/deserialize/utils/mod.rs 34.71% 79 Missing ⚠️
...olars-parquet/src/arrow/read/deserialize/simple.rs 53.65% 76 Missing ⚠️
...et/src/arrow/read/deserialize/fixed_size_binary.rs 25.74% 75 Missing ⚠️
crates/polars-io/src/parquet/read/read_impl.rs 52.90% 73 Missing ⚠️
crates/polars-parquet/src/arrow/read/expr.rs 0.00% 63 Missing ⚠️
.../src/arrow/read/deserialize/primitive/plain/mod.rs 40.40% 59 Missing ⚠️
... and 36 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #20714      +/-   ##
==========================================
- Coverage   79.73%   79.28%   -0.45%     
==========================================
  Files        1566     1578      +12     
  Lines      222591   224249    +1658     
  Branches     2572     2573       +1     
==========================================
+ Hits       177473   177799     +326     
- Misses      44526    45858    +1332     
  Partials      592      592              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@coastalwhite coastalwhite force-pushed the feat/parquet-expr branch 2 times, most recently from 582e147 to fb7158a Compare January 23, 2025 15:21
@coastalwhite coastalwhite merged commit eab0160 into pola-rs:main Jan 27, 2025
24 checks passed
@coastalwhite coastalwhite deleted the feat/parquet-expr branch January 27, 2025 08:00
@c-peters c-peters added the accepted Ready for implementation label Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature needs-bench Needs a benchmark run python Related to Python Polars rust Related to Rust Polars
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants