Delta Lake tables are very slow with DuckDB, faster with DataFusion, break with Polars for 1 billion rows #6771

lostmygithubaccount · 2023-08-04T02:29:00Z

lostmygithubaccount
Aug 4, 2023
Maintainer

I wanted to generate 1 billion rows of data and do some comparisons between backends. I ended up with a script like this to generate the data:

import ibis

con = ibis.connect("duckdb://billion.ddb")

ROWS = 1_000_000_000

sql_str = ""
sql_str += "select\n"
for c in list(map(chr, range(ord("a"), ord("z") + 1))):
    sql_str += f"  random() as {c},\n"
sql_str += f"from generate_series(1, {ROWS})"

t = con.sql(sql_str)
con.create_table("billion", t, overwrite=True)
t.to_delta("billion.delta")

resulting in a decent amount of data (larger than RAM):

155G    billion.ddb
216G    billion.delta

interesting behavior observed between DuckDB, DataFusion, and Polars, between Parquet and Delta Lake. to summarize:

DuckDB is very fast for read_parquet and very slow for read_delta
DataFusion is a bit slower for read_parquet and much faster for read_delta
Polars breaks with both

(goofy timing in the code below was due to issues with %time)

DuckDB:

[ins] In [1]: import ibis, time

[ins] In [2]: t = ibis.read_parquet("billion.delta/*.parquet")

[ins] In [3]: t1 = time.time(); print(f"{t.count().to_pandas():,}"); t2
         ...:  = time.time(); print(t2 - t1)
1,000,000,000
0.3874969482421875

[ins] In [4]: t = ibis.read_delta("billion.delta")

[ins] In [5]: t1 = time.time(); print(f"{t.count().to_pandas():,}"); t2
         ...:  = time.time(); print(t2 - t1)
100% ▕████████████████████████████████████████████████████████████▏
1,000,000,000
196.45281982421875

DataFusion:

[ins] In [1]: import ibis, time

[ins] In [2]: ibis.set_backend("datafusion")

[ins] In [3]: t = ibis.read_parquet("billion.delta/*.parquet")

[ins] In [4]: t1 = time.time(); print(f"{t.count().to_pandas():,}"); t2
         ...:  = time.time(); print(t2 - t1)
1,000,000,000
2.9429831504821777

[ins] In [5]: t = ibis.read_delta("billion.delta")

[ins] In [6]: t1 = time.time(); print(f"{t.count().to_pandas():,}"); t2
         ...:  = time.time(); print(t2 - t1)
1,000,000,000
12.984795808792114

Polars (both fail after tens of seconds):

[ins] In [1]: import ibis, time

[ins] In [2]: ibis.set_backend("polars")

[ins] In [3]: t = ibis.read_parquet("billion.delta/*.parquet")

[ins] In [4]: t1 = time.time(); print(f"{t.count().to_pandas():,}"); t2
         ...:  = time.time(); print(t2 - t1)
zsh: killed     python -c 'import IPython; IPython.terminal.ipapp.launch_new_instance()'

[ins] In [1]: import ibis, time

[ins] In [2]: ibis.set_backend("polars")

[ins] In [3]: t = ibis.read_delta("billion.delta")

[ins] In [4]: t1 = time.time(); print(f"{t.count().to_pandas():,}"); t2
         ...:  = time.time(); print(t2 - t1)
zsh: killed     python -c 'import IPython; IPython.terminal.ipapp.launch_new_instance()'

I'm not really sure what to make of this but figured I'd document it here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delta Lake tables are very slow with DuckDB, faster with DataFusion, break with Polars for 1 billion rows #6771

{{title}}

Replies: 0 comments

Select a reply

Delta Lake tables are very slow with DuckDB, faster with DataFusion, break with Polars for 1 billion rows #6771

lostmygithubaccount Aug 4, 2023 Maintainer

Replies: 0 comments

lostmygithubaccount
Aug 4, 2023
Maintainer