Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance when querying tables with many transactions in Azure #534

Open
kyrre opened this issue Nov 25, 2024 · 3 comments
Open
Labels
enhancement New feature or request

Comments

@kyrre
Copy link

kyrre commented Nov 25, 2024

Please describe why this is necessary.

Querying tables in Azure Data Lake Storage that have a lot of transactions takes forever.

This happens because the object_store crate is not able to simply list the blobs that were created after the latest checkpoint. This appears to be a limitation in the Azure API.

This does not happen when using Apache Spark (Databricks). In this case the operation finishes almost instantly, but it's not clear what they do differently.

Describe the functionality you are proposing.

There has to be some trick that Apache Spark is using. Perhaps it's possible to use lastModified to do the filter?

Additional context

This will happen whenever the table is a streaming destination, so it's very unfortunate.

@kyrre kyrre added the enhancement New feature or request label Nov 25, 2024
@zachschuermann
Copy link
Collaborator

Hi @kyrre! Thanks for opening this issue! One thing that could help us move forward on this front is to have a reproducible example (e.g. a table in Azure that we can observe takes 5s to read with delta-spark and 30s to read with kernel). Do you have a chance to look into that?

Also, yea, I recall there being some limitations with ADLS listing would be useful to document them here (and any of the limitations exposed via object_store)

@scovich
Copy link
Collaborator

scovich commented Nov 25, 2024

Yes, IIRC ADLS listing API is very limited compared to S3 or GCS. In particular, you can't specify a lower bound on the listing, so it just lists the whole directory every time. Given that this doesn't seem to be a problem in spark, we should check what the ADLS hadoop client is doing. If it has some clever workaround, we should file an issue upstream for object_store to incorporate that same trick.

Unfortunately, I don't know that there's a lot we could do from the kernel side, if object_store doesn't make this efficient -- kernel is not generally in the business of solving cloud store API issues.

@sherlockbeard
Copy link
Contributor

delta-io/delta#1568 jvm version of this ticket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants