Improve performance when querying tables with many transactions in Azure #534

kyrre · 2024-11-25T11:34:11Z

Please describe why this is necessary.

Querying tables in Azure Data Lake Storage that have a lot of transactions takes forever.

This happens because the object_store crate is not able to simply list the blobs that were created after the latest checkpoint. This appears to be a limitation in the Azure API.

This does not happen when using Apache Spark (Databricks). In this case the operation finishes almost instantly, but it's not clear what they do differently.

Describe the functionality you are proposing.

There has to be some trick that Apache Spark is using. Perhaps it's possible to use lastModified to do the filter?

Additional context

This will happen whenever the table is a streaming destination, so it's very unfortunate.

The text was updated successfully, but these errors were encountered:

zachschuermann · 2024-11-25T16:39:40Z

Hi @kyrre! Thanks for opening this issue! One thing that could help us move forward on this front is to have a reproducible example (e.g. a table in Azure that we can observe takes 5s to read with delta-spark and 30s to read with kernel). Do you have a chance to look into that?

Also, yea, I recall there being some limitations with ADLS listing would be useful to document them here (and any of the limitations exposed via object_store)

scovich · 2024-11-25T17:12:51Z

Yes, IIRC ADLS listing API is very limited compared to S3 or GCS. In particular, you can't specify a lower bound on the listing, so it just lists the whole directory every time. Given that this doesn't seem to be a problem in spark, we should check what the ADLS hadoop client is doing. If it has some clever workaround, we should file an issue upstream for object_store to incorporate that same trick.

Unfortunately, I don't know that there's a lot we could do from the kernel side, if object_store doesn't make this efficient -- kernel is not generally in the business of solving cloud store API issues.

sherlockbeard · 2024-12-13T19:05:12Z

delta-io/delta#1568 jvm version of this ticket

kyrre added the enhancement New feature or request label Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance when querying tables with many transactions in Azure #534

Improve performance when querying tables with many transactions in Azure #534

kyrre commented Nov 25, 2024 •

edited

Loading

zachschuermann commented Nov 25, 2024

scovich commented Nov 25, 2024

sherlockbeard commented Dec 13, 2024

Improve performance when querying tables with many transactions in Azure #534

Improve performance when querying tables with many transactions in Azure #534

Comments

kyrre commented Nov 25, 2024 • edited Loading

Please describe why this is necessary.

Describe the functionality you are proposing.

Additional context

zachschuermann commented Nov 25, 2024

scovich commented Nov 25, 2024

sherlockbeard commented Dec 13, 2024

kyrre commented Nov 25, 2024 •

edited

Loading