-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong result when .with_row_index()
is used with .collect(streaming=True)
#20694
Comments
.with_row_index()
is used in with .collect(streaming=True)
.with_row_index()
is used with .collect(streaming=True)
This works correctly in the new streaming engine, I'm not sure if it's worth the time to fix this in the old streaming engine if it will be replaced soon. |
I just found out, that is it not only exclusive to the See the following: import polars as pl
print(f"polars version: {pl.__version__}")
# define simple Dataframe
d = pl.DataFrame({"a": range(4)}, schema={"a": pl.Float64})
print(d)
# Make lazy
d = d.lazy()
# Split the data
d1 = d.filter(pl.col("a") >= 2)
d2 = d.filter(pl.col("a") < 2)
# make a processing with aliasing
d2 = (d2
.with_columns(pl.col("a").alias("index"))
.with_columns(pl.mean("index")) # <- The problem here is not limited to the `mean` function!
)
# Make d1 fit d2
d1 = d1.with_columns(pl.lit(0, dtype=pl.Float64).alias("index"))
# combine
combined = pl.concat(
[d2, d1],
how="vertical",
)
# Show Plans
print("\nWith row idx:\n", combined.explain())
# See different results for Streaming
print(f"\nStreaming=False: {combined.collect(streaming=False)}")
print(f"Streaming=True: {combined.collect(streaming=True)}") # <--- This result is wrong Results in the following output:
This clearly should not be and currently I cannot use and trust the |
This is problem is even more general than described above. I opened another issue #20833 and close this one. |
Checks
Reproducible example
Log output
Issue description
When using the
.with_row_index()
on a subset of aLazyFrame
, it does yield wrong results afterconcat()
when usingstreaming=True
. The part of the data where the index is applied is just not used for concatenation.With
streaming=False
everything works as expected.Also, when replacing the
.with_row_index("row_id").drop("row_id"),
part with.with_columns(pl.lit([9,9]).alias('row_id')).drop("row_id"),
everything works fine. So I guess it is really related to thewith_row_index()
function.Expected behavior
There should not be any difference between
streaming=True
andstreaming=False
.At least I did not find anything in the Docs regarding this.
Installed versions
----Optional dependencies----
adbc_driver_manager
altair
azure.identity
boto3 1.35.7
cloudpickle 3.0.0
connectorx
deltalake
fastexcel 0.11.6
fsspec 2023.12.2
gevent
google.auth 2.34.0
great_tables
matplotlib 3.9.2
nest_asyncio 1.6.0
numpy 1.24.4
openpyxl 3.1.5
pandas 2.2.2
pyarrow 13.0.0
pydantic
pyiceberg
sqlalchemy 2.0.32
torch 2.4.0+cpu
xlsx2csv 0.8.3
xlsxwriter 3.2.0
The text was updated successfully, but these errors were encountered: