Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File upload #758

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

File upload #758

wants to merge 1 commit into from

Conversation

dreadatour
Copy link
Contributor

@dreadatour dreadatour commented Dec 28, 2024

Implementing file upload.

Usage example:

import io
from typing import Iterator

import imageio
import imageio.v3 as iio
from ultralytics import YOLO

from datachain import DataChain, File
from datachain.catalog import get_catalog
from datachain.client import Client
from datachain.model.ultralytics import YoloPoses


def pose_estimation(client: Client, yolo: YOLO, file: File) -> Iterator[tuple[File, int, float, YoloPoses]]:
    stem = file.get_file_stem()
    ext = file.get_file_ext()

    reader = imageio.get_reader(io.BytesIO(file.read()), format=ext)
    fps = reader.get_meta_data()["fps"]

    for frame, img in enumerate(reader):
        filename = f"{stem}_{frame:06d}.jpg"
        img_encoded = iio.imwrite("<bytes>", img, extension=".jpeg")
        f = client.upload(filename, img_encoded)

        timestamp = frame / fps
        results = yolo(img)

        yield f, frame, timestamp, YoloPoses.from_results(results)


(
    DataChain.from_dataset("videos")
        .limit(1)
        .setup(
            client=lambda: get_catalog().get_client("gs://bucket/videos/frames"),
            yolo=lambda: YOLO("yolo11n-pose.pt"),
        )
        .gen(pose_estimation, output=("file", "frame", "timestamp", "poses"))
        .save("videos-frames-poses")
)

@dreadatour dreadatour self-assigned this Dec 28, 2024
@dreadatour dreadatour marked this pull request as draft December 28, 2024 18:11
Copy link

cloudflare-workers-and-pages bot commented Dec 28, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: ec29c5d
Status: ✅  Deploy successful!
Preview URL: https://225dbe30.datachain-documentation.pages.dev
Branch Preview URL: https://file-upload.datachain-documentation.pages.dev

View logs

Copy link

codecov bot commented Dec 28, 2024

Codecov Report

Attention: Patch coverage is 20.00000% with 4 lines in your changes missing coverage. Please review.

Project coverage is 87.24%. Comparing base (cf05881) to head (ec29c5d).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/client/fsspec.py 20.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #758      +/-   ##
==========================================
- Coverage   87.24%   87.24%   -0.01%     
==========================================
  Files         116      116              
  Lines       11018    11023       +5     
  Branches     1511     1511              
==========================================
+ Hits         9613     9617       +4     
- Misses       1028     1030       +2     
+ Partials      377      376       -1     
Flag Coverage Δ
datachain 87.18% <20.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dreadatour dreadatour requested a review from a team December 29, 2024 01:23
@dreadatour dreadatour marked this pull request as ready for review December 29, 2024 01:23
@shcheklein shcheklein mentioned this pull request Dec 29, 2024
@dreadatour dreadatour changed the title File upload (WIP) File upload Dec 30, 2024
@dreadatour dreadatour linked an issue Dec 30, 2024 that may be closed by this pull request
@skshetry
Copy link
Member

Is Client part of an API?

@dreadatour
Copy link
Contributor Author

Is Client part of an API?

Not really. I set it up via:

setup(
    client=lambda: get_catalog().get_client("gs://bucket/videos/frames"),
)

(see example in PR description).

It is an open question of simplifying API (making upload to be a part of File may be?)

@@ -364,6 +364,12 @@ def open_object(
assert not file.location
return FileWrapper(self.fs.open(self.get_full_path(file.path)), cb) # type: ignore[return-value]

def upload(self, path: str, data: bytes) -> "File":
full_path = self.get_full_path(path)
self.fs.pipe_file(full_path, data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can make fsspec return File result object. It's very sad that we have to do the second call here to get info - it's an additional API call and slow.

@@ -364,6 +364,12 @@ def open_object(
assert not file.location
return FileWrapper(self.fs.open(self.get_full_path(file.path)), cb) # type: ignore[return-value]

def upload(self, path: str, data: bytes) -> "File":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we add a tests here? or do you plan to rewrite it?

I know that it is a thin wrapper, but it still hits multiple apis + path, etc

@dmpetrov
Copy link
Member

dmpetrov commented Jan 2, 2025

Created #771 as the next step for the upload.

@dmpetrov dmpetrov mentioned this pull request Jan 2, 2025
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data upload
5 participants