Delta Lake tables are very slow with DuckDB, faster with DataFusion, break with Polars for 1 billion rows #6771
lostmygithubaccount
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I wanted to generate 1 billion rows of data and do some comparisons between backends. I ended up with a script like this to generate the data:
resulting in a decent amount of data (larger than RAM):
interesting behavior observed between DuckDB, DataFusion, and Polars, between Parquet and Delta Lake. to summarize:
read_parquet
and very slow forread_delta
read_parquet
and much faster forread_delta
(goofy timing in the code below was due to issues with
%time
)DuckDB:
DataFusion:
Polars (both fail after tens of seconds):
I'm not really sure what to make of this but figured I'd document it here
Beta Was this translation helpful? Give feedback.
All reactions