Expose the Polars SQL API? #818

cigrainger · 2024-01-13T12:51:08Z

https://crates.io/crates/polars-sql
https://docs.rs/polars-sql/0.36.2/polars_sql/fn.sql_expr.html
https://docs.pola.rs/py-polars/html/reference/sql

I'm not a huge fan of python polars here with registering etc explicitly. It feels very... Spark-ish to me.

I like the way dplyr allows you to just write sql(...). https://dbplyr.tidyverse.org/articles/sql.html

We could have Explorer.DataFrame.sql(df|lf, "statement").

I'm very open to differing opinions on this!

The text was updated successfully, but these errors were encountered:

josevalim · 2024-01-19T19:26:59Z

At first, I am -1, unless we have a very strong use case for this. Explorer is meant to be agnostic, and we work hard for this to be the case, but the sql semantics would be necessarily tied to polars here.

cigrainger · 2024-01-20T16:32:13Z

It definitely is a big win for OLAP use cases. I'm not sure I understand why we'd be tied to Polars SQL semantics here -- for an explorer_sql backend it can just be passed through directly. Backends don't necessarily have to support it. And it can be documented that the accepted SQL queries depend on the backend.

josevalim · 2024-01-20T17:04:10Z

In explorer_sql, sql is how you load data into memory. Polars SQL is a language for manipulating an existing in-memory dataframe. So they sit at different levels. And if I use Polars SQL to manipulate a Polars DataFrame, I won't be able to use to manipulate a Explorer SQL dataframe.

cigrainger · 2024-01-21T08:06:05Z

Ah but that's not how I'm imagining explorer_sql -- the idea is not just to load the data into memory (in which case there's no point -- just use ADBC or whatever else) but to convert Explorer functions to SQL and run the queries on the database itself (compute) until you pull it down (collect). dbplyr uses tmp tables. You can also fix to actual tables.

I don't see a problem with the SQL syntax being backend dependent -- since it's passing through it would be handled by the backend anyway. This is also how dbplyr works if I'm not mistaken.

Re: compute and collect from dbplyr:

compute() stores results in a remote temporary table. collect() retrieves data into a local tibble. collapse() is slightly different: it doesn't force computation, but instead forces generation of the SQL query. This is sometimes needed to work around bugs in dplyr's SQL generation.

josevalim · 2024-01-21T08:13:37Z

Interesting. That would indeed close the gap, so I am ok with going down this path, although I wonder if we should tackle Explorer SQL first to have a better of how it will all connect. :)

cigrainger · 2024-01-21T08:15:06Z

I'm cool with that! Though I'm pretty adamant that in explorer_sql this function would basically just act as a transparent 'passthrough' escape hatch, permitting you to run raw SQL on the db directly.

billylanchantin · 2024-01-21T15:45:20Z

I'm super +1 on SQL interop, and if either exposing the Polars SQL API or working on Explorer SQL are the path forward, then I'm behind them.

I admit I'm still unclear on:

the specific use cases we want to prioritize
what roles the different layers (Explorer's API, Explorer's specific backends, Polars SQL API, SQL itself, etc.) will play

Wrapping my head around that those things is my personal barrier to +1 to either specific approach. I'm not against either! I just don't think I have all the facts yet.

In particular, I think outlining a few specific use cases with pseudo-code would be instructive. I can't tell, for instance, how lazy frames play with this feature.

josevalim · 2024-02-20T07:52:50Z

I don't see a problem with the SQL syntax being backend dependent -- since it's passing through it would be handled by the backend anyway. This is also how dbplyr works if I'm not mistaken.

I have just realized that, we don't use the exact same names as polars. So if we expose polars_sql, it would start to feel very janky. For example, nil vs null and probably several others. So ultimately, I believe I am a -1 on this approach. If we want to go down this road, it may make more sense to introduce our own SQL that compiles down to dataframe operations.

billylanchantin mentioned this issue Jun 2, 2024

[Proof of concept] Expr serialize #916

Closed

billylanchantin mentioned this issue Oct 2, 2024

Support for SQLContext #999

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose the Polars SQL API? #818

Expose the Polars SQL API? #818

cigrainger commented Jan 13, 2024

josevalim commented Jan 19, 2024

cigrainger commented Jan 20, 2024

josevalim commented Jan 20, 2024

cigrainger commented Jan 21, 2024 •

edited

Loading

josevalim commented Jan 21, 2024

cigrainger commented Jan 21, 2024

billylanchantin commented Jan 21, 2024

josevalim commented Feb 20, 2024

Expose the Polars SQL API? #818

Expose the Polars SQL API? #818

Comments

cigrainger commented Jan 13, 2024

josevalim commented Jan 19, 2024

cigrainger commented Jan 20, 2024

josevalim commented Jan 20, 2024

cigrainger commented Jan 21, 2024 • edited Loading

josevalim commented Jan 21, 2024

cigrainger commented Jan 21, 2024

billylanchantin commented Jan 21, 2024

josevalim commented Feb 20, 2024

cigrainger commented Jan 21, 2024 •

edited

Loading