Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose the Polars SQL API? #818

Open
cigrainger opened this issue Jan 13, 2024 · 8 comments
Open

Expose the Polars SQL API? #818

cigrainger opened this issue Jan 13, 2024 · 8 comments

Comments

@cigrainger
Copy link
Member

https://crates.io/crates/polars-sql
https://docs.rs/polars-sql/0.36.2/polars_sql/fn.sql_expr.html
https://docs.pola.rs/py-polars/html/reference/sql

I'm not a huge fan of python polars here with registering etc explicitly. It feels very... Spark-ish to me.

I like the way dplyr allows you to just write sql(...). https://dbplyr.tidyverse.org/articles/sql.html

We could have Explorer.DataFrame.sql(df|lf, "statement").

I'm very open to differing opinions on this!

@josevalim
Copy link
Member

At first, I am -1, unless we have a very strong use case for this. Explorer is meant to be agnostic, and we work hard for this to be the case, but the sql semantics would be necessarily tied to polars here.

@cigrainger
Copy link
Member Author

It definitely is a big win for OLAP use cases. I'm not sure I understand why we'd be tied to Polars SQL semantics here -- for an explorer_sql backend it can just be passed through directly. Backends don't necessarily have to support it. And it can be documented that the accepted SQL queries depend on the backend.

@josevalim
Copy link
Member

In explorer_sql, sql is how you load data into memory. Polars SQL is a language for manipulating an existing in-memory dataframe. So they sit at different levels. And if I use Polars SQL to manipulate a Polars DataFrame, I won't be able to use to manipulate a Explorer SQL dataframe.

@cigrainger
Copy link
Member Author

cigrainger commented Jan 21, 2024

Ah but that's not how I'm imagining explorer_sql -- the idea is not just to load the data into memory (in which case there's no point -- just use ADBC or whatever else) but to convert Explorer functions to SQL and run the queries on the database itself (compute) until you pull it down (collect). dbplyr uses tmp tables. You can also fix to actual tables.

I don't see a problem with the SQL syntax being backend dependent -- since it's passing through it would be handled by the backend anyway. This is also how dbplyr works if I'm not mistaken.

Re: compute and collect from dbplyr:

compute() stores results in a remote temporary table. collect() retrieves data into a local tibble. collapse() is slightly different: it doesn't force computation, but instead forces generation of the SQL query. This is sometimes needed to work around bugs in dplyr's SQL generation.

@josevalim
Copy link
Member

Interesting. That would indeed close the gap, so I am ok with going down this path, although I wonder if we should tackle Explorer SQL first to have a better of how it will all connect. :)

@cigrainger
Copy link
Member Author

I'm cool with that! Though I'm pretty adamant that in explorer_sql this function would basically just act as a transparent 'passthrough' escape hatch, permitting you to run raw SQL on the db directly.

@billylanchantin
Copy link
Contributor

I'm super +1 on SQL interop, and if either exposing the Polars SQL API or working on Explorer SQL are the path forward, then I'm behind them.

I admit I'm still unclear on:

  • the specific use cases we want to prioritize
  • what roles the different layers (Explorer's API, Explorer's specific backends, Polars SQL API, SQL itself, etc.) will play

Wrapping my head around that those things is my personal barrier to +1 to either specific approach. I'm not against either! I just don't think I have all the facts yet.

In particular, I think outlining a few specific use cases with pseudo-code would be instructive. I can't tell, for instance, how lazy frames play with this feature.

@josevalim
Copy link
Member

I don't see a problem with the SQL syntax being backend dependent -- since it's passing through it would be handled by the backend anyway. This is also how dbplyr works if I'm not mistaken.

I have just realized that, we don't use the exact same names as polars. So if we expose polars_sql, it would start to feel very janky. For example, nil vs null and probably several others. So ultimately, I believe I am a -1 on this approach. If we want to go down this road, it may make more sense to introduce our own SQL that compiles down to dataframe operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants