prefetch: use a separate temporary cache for prefetching #730

skshetry · 2024-12-23T12:21:29Z

This PR will use a separate temporary cache for prefetching that resides in .datachain/tmp/prefetch-<random> directory when prefetch= is set but cache is not.
The temporary directory will be automatically deleted after the prefetching is done.

For cache=True, the cache will be reused and won't be deleted.

Please note that auto-cleanup does not work for PyTorch datasets because there is no way to invoke cleanup from the Dataset side. The DataLoader may still have cached data or rows even after the Dataset instance has finished iterating. As a result, values associated with a catalog/cache instance can outlive the Dataset instance.

One potential solution is to implement a custom dataloader or provide a user-facing API.
In this PR, I have implemented the latter. The PytorchDataset now includes a close() method, which can be used to clean up the temporary prefetch cache.

Eg:

dataset = dc.to_pytorch(...)
with closing(dataset):
    pass

cloudflare-workers-and-pages · 2024-12-23T12:21:31Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`a516de8`
Status:	✅ Deploy successful!
Preview URL:	https://7d2ddfe7.datachain-documentation.pages.dev
Branch Preview URL:	https://prefetch-cache.datachain-documentation.pages.dev

View logs

cloudflare-workers-and-pages · 2024-12-23T12:21:31Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`afae789`
Status:	✅ Deploy successful!
Preview URL:	https://d7bd07c5.datachain-documentation.pages.dev
Branch Preview URL:	https://prefetch-cache.datachain-documentation.pages.dev

View logs

codecov · 2024-12-24T19:26:49Z

Codecov Report

Attention: Patch coverage is 92.64706% with 10 lines in your changes missing coverage. Please review.

Project coverage is 87.39%. Comparing base (8dfa4ff) to head (a516de8).

Files with missing lines	Patch %	Lines
src/datachain/lib/file.py	60.00%	2 Missing and 2 partials ⚠️
src/datachain/progress.py	78.57%	3 Missing ⚠️
src/datachain/lib/pytorch.py	93.75%	1 Missing and 1 partial ⚠️
src/datachain/cache.py	95.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #730      +/-   ##
==========================================
+ Coverage   87.33%   87.39%   +0.06%     
==========================================
  Files         116      116              
  Lines       11147    11217      +70     
  Branches     1532     1536       +4     
==========================================
+ Hits         9735     9803      +68     
  Misses       1032     1032              
- Partials      380      382       +2

Flag	Coverage Δ
datachain	`87.33% <92.64%> (+0.06%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

skshetry · 2024-12-31T16:16:54Z

src/datachain/asyn.py

@@ -179,6 +180,7 @@ def iterate(self, timeout=None) -> Generator[ResultT, None, None]:
            self.shutdown_producer()
            if not async_run.done():
                async_run.cancel()
+                wait([async_run])


.cancel() does not immediately cancel the task the underlying asyncio task.

We could add a .result() to wait for the future, but that does not seem to work either for the cancelled future from run_coroutine_threadsafe(). See python/cpython#105836.

So, I have added wait(...) as it seems to wait the cancelled future, and wait for underlying asyncio task.

Alternatively, we could add an asyncio.Event and wait for it.

src/datachain/asyn.py

skshetry · 2024-12-31T16:44:09Z

src/datachain/lib/file.py

+        if client.protocol == HfClient.protocol:
+            self._set_stream(catalog, self._caching_enabled, download_cb=download_cb)
+            return False


prefetch is disabled for huggingface. See #746.

skshetry · 2024-12-31T16:45:55Z

src/datachain/lib/pytorch.py

+        if os.getenv("DATACHAIN_SHOW_PREFETCH_PROGRESS"):
+            download_cb = get_download_callback(
+                f"{total_rank}/{total_workers}", position=total_rank
+            )


This shows a prefetch download progressbar for each worker which will be useful for debugging.

We cannot enable this by default, as this will mess up user's progressbar due to multiprocessing.

skshetry · 2024-12-31T16:47:23Z

src/datachain/progress.py

+        pass
+
+
+class TqdmCombinedDownloadCallback(CombinedDownloadCallback, TqdmCallback):


I have modified the callback to also show file counts on prefetching.
This will not show up on pytorch however.

Eg:

Download: 1.03MB [00:01, 605kB/s, 50 files]

Unless `cache=True` is set, a separate temporary cache will be used for prefetching. It will get removed after the iteration is closed.

skshetry force-pushed the prefetch-cache branch from afae789 to 1266e4a Compare December 23, 2024 12:21

skshetry temporarily deployed to internal December 23, 2024 12:21 — with GitHub Actions Inactive

skshetry mentioned this pull request Dec 23, 2024

pre-fetch should work without caching as well #647

Open

skshetry force-pushed the prefetch-cache branch from 1266e4a to 15c30fb Compare December 24, 2024 04:40

skshetry temporarily deployed to internal December 24, 2024 04:40 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from 15c30fb to 1b34bc0 Compare December 24, 2024 08:04

skshetry temporarily deployed to internal December 24, 2024 08:04 — with GitHub Actions Inactive

skshetry temporarily deployed to internal December 24, 2024 15:39 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from 1862bd0 to 90f1b7c Compare December 24, 2024 16:37

skshetry temporarily deployed to internal December 24, 2024 16:38 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from 90f1b7c to 0ee1da1 Compare December 24, 2024 19:16

skshetry temporarily deployed to internal December 24, 2024 19:16 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from 0ee1da1 to 0ca4e5f Compare December 25, 2024 07:53

skshetry temporarily deployed to internal December 25, 2024 07:53 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from 0ca4e5f to 5770599 Compare December 25, 2024 08:03

skshetry temporarily deployed to internal December 25, 2024 08:04 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from 5770599 to a58c8a3 Compare December 25, 2024 10:50

skshetry temporarily deployed to internal December 25, 2024 10:50 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from a58c8a3 to bb8cc22 Compare December 25, 2024 11:43

skshetry temporarily deployed to internal December 25, 2024 11:43 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from bb8cc22 to acd168e Compare December 25, 2024 12:01

skshetry temporarily deployed to internal December 25, 2024 12:01 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from acd168e to b7e620b Compare December 26, 2024 10:00

skshetry temporarily deployed to internal December 26, 2024 10:00 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from b7e620b to 278af30 Compare December 26, 2024 10:25

skshetry temporarily deployed to internal December 26, 2024 10:25 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from 278af30 to c1146ef Compare December 31, 2024 09:26

skshetry commented Dec 31, 2024

View reviewed changes

src/datachain/asyn.py Outdated Show resolved Hide resolved

skshetry temporarily deployed to internal December 31, 2024 16:34 — with GitHub Actions Inactive

skshetry marked this pull request as ready for review December 31, 2024 16:42

skshetry temporarily deployed to internal December 31, 2024 16:42 — with GitHub Actions Inactive

skshetry requested a review from a team December 31, 2024 16:42

skshetry commented Dec 31, 2024

View reviewed changes

skshetry force-pushed the prefetch-cache branch from 392aaff to 1d3012d Compare January 1, 2025 09:23

skshetry temporarily deployed to internal January 1, 2025 09:23 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from 1d3012d to b9ee297 Compare January 1, 2025 10:26

skshetry temporarily deployed to internal January 1, 2025 10:26 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from b9ee297 to 7adfc0a Compare January 2, 2025 11:56

skshetry temporarily deployed to internal January 2, 2025 11:57 — with GitHub Actions Inactive

skshetry marked this pull request as draft January 3, 2025 09:16

skshetry force-pushed the prefetch-cache branch from 7adfc0a to f9e580d Compare January 3, 2025 10:22

skshetry temporarily deployed to internal January 3, 2025 10:22 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from f9e580d to bf24f57 Compare January 3, 2025 10:35

skshetry temporarily deployed to internal January 3, 2025 10:36 — with GitHub Actions Inactive

skshetry marked this pull request as ready for review January 3, 2025 11:35

skshetry temporarily deployed to internal January 3, 2025 15:55 — with GitHub Actions Inactive

skshetry added 6 commits January 3, 2025 21:43

use separate cache for prefetch by default

1a3318b

Unless `cache=True` is set, a separate temporary cache will be used for prefetching. It will get removed after the iteration is closed.

prefetch: disable for huggingface

f5482b3

add tests for prefetch

3804445

cancel future and wait for it

84e615b

hoist temporary cache creation to Mapper

b8809d7

refactor udfs

a516de8

skshetry force-pushed the prefetch-cache branch from 950170b to a516de8 Compare January 3, 2025 15:58

skshetry deployed to internal January 3, 2025 15:58 — with GitHub Actions Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prefetch: use a separate temporary cache for prefetching #730

prefetch: use a separate temporary cache for prefetching #730

skshetry commented Dec 23, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Dec 23, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Dec 23, 2024

codecov bot commented Dec 24, 2024 •

edited

Loading

skshetry Dec 31, 2024

skshetry Dec 31, 2024

skshetry Dec 31, 2024

skshetry Dec 31, 2024

		pass


		class TqdmCombinedDownloadCallback(CombinedDownloadCallback, TqdmCallback):

prefetch: use a separate temporary cache for prefetching #730

Are you sure you want to change the base?

prefetch: use a separate temporary cache for prefetching #730

Conversation

skshetry commented Dec 23, 2024 • edited Loading

cloudflare-workers-and-pages bot commented Dec 23, 2024 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

cloudflare-workers-and-pages bot commented Dec 23, 2024

Deploying datachain-documentation with Cloudflare Pages

codecov bot commented Dec 24, 2024 • edited Loading

Codecov Report

skshetry Dec 31, 2024

Choose a reason for hiding this comment

skshetry Dec 31, 2024

Choose a reason for hiding this comment

skshetry Dec 31, 2024

Choose a reason for hiding this comment

skshetry Dec 31, 2024

Choose a reason for hiding this comment

skshetry commented Dec 23, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Dec 23, 2024 •

edited

Loading

codecov bot commented Dec 24, 2024 •

edited

Loading