Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty-named files / directories need special care #762

Closed
dmpetrov opened this issue Dec 30, 2024 · 1 comment · Fixed by #767
Closed

Empty-named files / directories need special care #762

dmpetrov opened this issue Dec 30, 2024 · 1 comment · Fixed by #767
Labels
bug Something isn't working

Comments

@dmpetrov
Copy link
Member

Description

There is a bucket that contains directory examples with files. It looks like this directory was created in cloud console UI - it has a empty-name record for this directory as far as I can see. This is a common way of creating empty dirs in such storages.

To handle this correctly, we need to treat these empty-named files in a special way - they should look like dirs. This will help prevent issues when working with files and folders in the bucket.

Issue 1. When I list this directory I see only this empty-name record. It looks like we need to handle these empty files / directories in a special way.

from datachain import DataChain

dc = DataChain.from_storage("gs://mybucket/examples").save("myds")
dc.show()
print("Files: ", dc.count())
                         file      file file              file              file  \
                       source      path size           version              etag
0  gs://mybucket  examples    0  1735506238856683  COu7/cbwzYoDEAE=
       file                             file     file
  is_latest                    last_modified location
0         1 2024-12-29 21:03:58.859000+00:00     None
Files:  1

The result is expected if I list it as a dir:

from datachain import DataChain

dc = DataChain.from_storage("gs://mybucket/examples/").save("myds")
dc.show()
print("Files: ", dc.count())
                          file                                               file  \
                        source                                               path
0   gs://mybucket  examples/videos/HoldingPen_...
1   gs://mybucket  examples/videos/HoldingPen_...
...
Files:  171

Issue 2. System signals seem broken for this special file / dir

dc = DataChain.from_storage("gs://mybucket/examples").save("myds")
dc.shuffle().show()
datachain.lib.signal_schema.SignalResolvingError: cannot resolve signal name 'sys.rand': is not found

Version Info

0.8.3
Python 3.10.8
@dmpetrov dmpetrov added the bug Something isn't working label Dec 30, 2024
@dmpetrov dmpetrov changed the title Empty-named files / directories needs a special care Empty-named files / directories need special care Dec 30, 2024
@shcheklein
Copy link
Member

Relevant link: https://cloud.google.com/storage/docs/objects#simulated-folders

These directories indeed were created using console UI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants