Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmenting the interface #177

Open
asinghvi17 opened this issue Aug 27, 2024 · 3 comments
Open

Segmenting the interface #177

asinghvi17 opened this issue Aug 27, 2024 · 3 comments

Comments

@asinghvi17
Copy link

asinghvi17 commented Aug 27, 2024

There seem to be three principal modes with which people access data in files:

  1. Reading (this is pure reading, maybe you want to read some specific byte range or read incrementally)
  2. Writing (specifically writing to a single file)
  3. Examining file structure (ls, joinpath, du, etc)
  4. Manipulating structure (mv, tempdir, etc)

As far as I can tell, the FilePathsBase API doesn't currently make a formal distinction between these three APIs. Would it make sense to do so?

This way, things like HTTP paths can simply opt in to the pure-reading interface, whereas a local path could also implement the writing and manipulating interfaces. We can then also have nice interface tests for that, and it would probably make it conceptually easier to implement random filesystems, like zip files (which wouldn't support a tempdir, for example).

@rofinn
Copy link
Owner

rofinn commented Sep 2, 2024

Interesting. I hadn't thought of splitting it up like this, but that might work nicely with some preliminary work @ExpandingMan was doing on supporting more of a key-value store interface.

#159

Basically, right now we're assuming a filesystem interface includes two things:

  1. An IO interface like reading and writing objects/files
  2. A tree navigation and manipulation interface like ls, joinpath, mkdir

I think the IO interface is a strict requirement, but the tree interface could easily be a hash table or some other associative datastructure.

Perhaps tangential to this particular issue, but I think it'd be kinda cools if you could map "filesystem" operations directly to datastructure ops.

@ExpandingMan
Copy link
Contributor

It's been quite a while since I looked at this. Basically I was interested in supporting S3, which is a key-value store (see AWSS3.jl). It works as is, but is more than a little hacky. There's a whole bunch of things that go horribly wrong on remote filesystems that are not necessarily related to the posted issue.

In a perfect world, I wouldn't think generalizing this package to things like key-value stores or HTTP makes much sense at all. It is built around a tree-like abstraction in which directories are nodes and I think that's fine, especially since that's how actual file systems actually work. The problem is that in real life, for better or worse, S3 (and even now S3-compatible key-value storage alternatives) are really important and arguably becoming even more important. So maybe it's worth doing? I don't know. I still see the S3 use case as important, HTTP is probably stretching it way further.

@asinghvi17
Copy link
Author

FWIW, I have usecases in S3, HTTP, and S3 derivatives (google cloud, minio, etc). The main things I want to do are:

  • ls on a "directory". This may involve caching results or constructing a key value store locally.
  • readbytes(path, start, stop) to read a byte range from a remote file.

I'm not so concerned about e.g. write or tempdir as such, at least not for some of the more exotic stores like HTTP.

The idea about HTTPPath is mostly to support a similar interface to read from HTTP "file stores", which are somehow quite common in large geospatial datasets. In that case I might not have ls, for example, implemented. I also need to get all of this to work with a "ReferenceFileSystemwhich is basically an in-memory fake filesystem that can have data either inline as bytes, or as a combination of[filepath, start_byte_index, stop_byte_index]`.

Overall the idea is to make getting data from arbitrary filesystem-like data stores painless and easy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants