Specify storage usability #1

liamhuber · 2024-09-30T14:13:15Z

What should be the requirement for something to be usable in our storage? Just like we follow concurrent.futures.Executor, I think we should follow pickle for requiring inputs. Of course what the output should look like (hierarchical, etc.) is separate

samwaseda · 2024-09-30T15:33:12Z

I didn't notice that you were editing the same file XD

I think we should follow pickle for requiring inputs

What do you mean by this?

liamhuber · 2024-09-30T16:22:42Z

I think we can merge yours first, then break the doc into two parts: (1) what objects are allowed to be stored? (what my PR concerns itself with) and (b) what do they look like/what can we do with them after they're stored? (what your PR concerns itself with)

I should also be more precise here. I mean to say that the back-ends should leverage __reduce__, or, at a minimum, be able to store anything that pickle would be able to store by leveraging __reduce__.

samwaseda · 2024-09-30T16:26:33Z

I should also be more precise here. I mean to say that the back-ends should leverage __reduce__, or, at a minimum, be able to store anything that pickle would be able to store by leveraging __reduce__.

ok but doesn't it mean that my definition of dict is too restrictive? Shouldn't it be replaced by __reduce__?

Actually I'm gonna merge mine anyway and we can change it.

liamhuber · 2024-09-30T16:34:24Z

Yes and also yes 😂

jan-janssen · 2024-09-30T19:34:32Z

Can we start a bit more abstract?

Storage should provide the ability to store multiple objects in a single file.
Storage should store the state of an object by serialising and reload the state of an object by deserialising. This is defined in the python standard library __reduce__ - https://docs.python.org/3/library/pickle.html#object.__reduce__ and __setstate__ - https://docs.python.org/3/library/pickle.html#object.__setstate__
The developer should have the option to overload the default interface of the standard library to provide a serialisation specifically for pyiron. For example in addition to __reduce__ and __setstate__ objects in pyiron_base currently implement to_dict() and from_dict() to provide a reduced representation.
We would like to support lazy loading when the developers for the specific objects implement it. This is something pyiron_base currently struggles with resulting in frequent access of HDF5 files hdf5 files are opened too often in DataContainer when loading SPHInX jobs pyiron_atomistics#1574 . As part of h5io_browser I recently added glob pattern matching to support reloading only specific entries from the HDF5 file list_hdf() & read_dict_from_hdf(): Add glob-style pattern matching h5io/h5io_browser#74
We would like to support multiple file formats (HDF5, JSON, ...). While the developers can provide optimised storage routines, we could also provide standardised converters to convert from the python data types to the data types of the specific file format, for example in the case of HDF5 these conversions are defined in h5io - https://github.com/h5io/h5io/blob/main/h5io/_h5io.py#L184

samwaseda · 2024-09-30T19:55:06Z

Can we start a bit more abstract?

I don't know what you mean. Feel free to append it to the text.

Storage should provide the ability to store multiple objects in a single file.

Storage should store the state of an object by serialising and reload the state of an object by deserialising. This is defined in the python standard library __reduce__ - https://docs.python.org/3/library/pickle.html#object.__reduce__ and __setstate__ - https://docs.python.org/3/library/pickle.html#object.__setstate__

Yes, I stopped working today after my first PR but otherwise I would have added it. I guess you can safely add this part.

The developer should have the option to overload the default interface of the standard library to provide a serialisation specifically for pyiron. For example in addition to __reduce__ and __setstate__ objects in pyiron_base currently implement to_dict() and from_dict() to provide a reduced representation.

I would say that's a requirement for the tools using the storage interface and not part of the specs for the storage interface.

We would like to support lazy loading when the developers for the specific objects implement it

It's an excellent point, but we have to properly define what lazy loading is.

We would like to support multiple file formats (HDF5, JSON, ...). While the developers can provide optimised storage routines, we could also provide standardised converters to convert from the python data types to the data types of the specific file format, for example in the case of HDF5 these conversions are defined in h5io - https://github.com/h5io/h5io/blob/main/h5io/_h5io.py#L184

YES! I guess we can pretty much take over the same definition for json? I'm ok with almost copy and paste the conversions from h5io or create its own module for json

Overall, I think you can open new PR with these points or add them to this PR.

jan-janssen · 2024-09-30T20:11:06Z

Overall, I think you can open new PR with these points or add them to this PR.

I would like to have one PR per topic, this hopefully simplifies the discussion a bit. So if Liam agrees he can either integrate my suggestions or he can merge his pull request and I open a new one with the changes discussed here.

jan-janssen · 2024-09-30T20:12:51Z

I would say that's a requirement for the tools using the storage interface and not part of the specs for the storage interface.

To me it is very important that we have both definitions somewhere - maybe even both in this file to define what do we promise for a software using the storage interface and what do we require from a software which wants to use the storage interface.

samwaseda · 2024-10-01T05:34:45Z

To me it is very important that we have both definitions somewhere - maybe even both in this file to define what do we promise for a software using the storage interface and what do we require from a software which wants to use the storage interface.

ok that's true; I reformulate my point: It sounds to me like it's just a question of either:

save(my_object)

or

save(my_object.to_dict())

Correct? I don't have a very very strong feeling, but to me it sounds like the addition of to_dict and from_dict only complicates the specs with little gain. But as I said, the addition doesn't make it extremely complex, so if you really want to have it inside the storage specs I'm probably ok with that.

jan-janssen · 2024-10-01T06:16:07Z

I am strongly in favor of the first option, the question is primarily what happens internally and do we call it save or dump? If the object has an to_dict() function then we want to call to_dict() or potentially even to_hdf() if an object has an optimized storage routine for a specific file format. Otherwise we just try to call __getstate__() or __reduce__() which starting from python 3.11 is defined for every python object. If the storage via __getstate__() / __reduce__() fails we can still use cloud pickle to get a binary representation and then store the binary string, this no longer allows us to access attributes of the stored object without loading the full object but it is at least a fall back option.

Now we have to define to interfaces, how can a developer make sure that their objects can be stored using our storage interface? This should be very simple with minimal requirements. The second interface is: What is required to implement a new file format in our storage interface? How do we define the mapping of data types defined by the file format to the mapping of data types supported by python? Both of these interfaces should be defined in our storage specs.

samwaseda · 2024-10-01T06:54:25Z

ok then let's add to_dict.

About dump or save: @ligerzero-ai recently pointed out that you get an answer a lot faster by writing something wrong in the internet than trying to write the right answer. In this spirit my current strategy is to push whatever comes to my mind to be corrected by you guys. So don't take anything I pushed as a fait accompli. I'm equally fine with dump or save.

samwaseda · 2024-10-01T06:55:00Z

In other words: Feel free to open new PRs overwriting what I've already pushed. I don't take it personally.

liamhuber · 2024-10-01T13:59:29Z

I am a strong advocate of the first option. To the extent that I don't want to use the to_dict business at all. What is the advantage of having a whole second to_dict routine where users modify what state is used to save the object vs having them modify __getstate__ (and maybe __setstate__. At first glance I really feel that this complicates the interface to just accommodate some sort of power user that could just use the regular route to begin with if they need special behaviour.

ligerzero-ai · 2024-10-01T14:38:20Z

the question about dump or save is a question of target demographic;

Who do we envision to be interacting with the code?

Will it be full-on software developers (imho, unlikely) or scientists that know a little bit about programming (which I think is the target audience for pyiron). "save" is linguistically simple and what it does should be easily understood by users. It doesn't require knowledge about python serialisation which usually uses syntax "dump".

So perhaps the easiest way to define syntax and expected interfaces would be which level of user is present at different levels of the interfaces. Then, defining stuff should be easier, since we can say that there is a certain level of expected knowledge regarding users interacting with specific parts of our code.

liamhuber · 2024-10-01T17:01:24Z

Ok, I resolved the merge conflict and updated this with my dream specs.

I would say that's a requirement for the tools using the storage interface and not part of the specs for the storage interface.

To me it is very important that we have both definitions somewhere

I went a step further and broke this down to three levels: user, generic interface, a particular storage back-end. This gives us some nice flexibility, as the user-facing interface can implement things like save and load per @ligerzero-ai's comment, while different back-ends can adhere more closely to pickle and implement dump.

Now we have to define to interfaces, how can a developer make sure that their objects can be stored using our storage interface? This should be very simple with minimal requirements. The second interface is: What is required to implement a new file format in our storage interface? How do we define the mapping of data types defined by the file format to the mapping of data types supported by python? Both of these interfaces should be defined in our storage specs.

IMO this is what kills storage in pyiron_base -- we can't have some limited subset of data types that are handled by a storage tool and everything else isn't; all our storage tools need to just work. For me that means if you can pickle it, you can save it and load it with our tool. Even better, our tool should fall back on cloudpickle and just store the bytes if it fails in its first attack (I call this the "cloudpickle idea" in the text). It's the responsibility of the back-end designer to ensure that their tool is powerful enough to handle (almost) everything.

At the extrema, I could see how some clear subset of objects are blacklisted. E.g. a back-end might not be capable of handling "cyclic" objects (parent.child.parent is parent'); in this case the data type developer can go into things like __getstate__ and __setstate__ to purge and re-instate these sort of cyclic connections. I do this in pyiron_workflow for optimization reasons (don't want to drag the whole graph along when I send a single node to the executor), but ideally a given storage routine shouldn't require it. I would be comfortable making exceptions if the back-end routine is clear enough in what it doesn't support and fails cleanly when a user passes something unsupported in, but I want to work in a blacklisting paradigm not a whitelisting paradigm.

added concept of default data types that automatically support (json, hdf5) storage

jan-janssen

Looks good to me

jan-janssen · 2024-10-05T11:27:44Z

Beyond the points included in this pull request:

The lazy loading is one of the core concepts of pyiron, so I think we should define them in a bit more detail. What kind of object is returned when an object is lazy loaded from a file. What kind of functionality does the object have after lazy loading? What kind of functionality requires loading the full object? On the pyiron_base side we never really figured this out or the implementation based on the datacontainer resulted in frequent file opening and closing.
Can we already in the specs define the conversion of data types which are available in python to the subset of data types which are supported by the different file formats? For example HDF5 provides groups and nodes, with nodes being individual datasets as multi dimensional arrays and groups being collections of nodes or sub-groups. Tools like h5io store the types of the nodes in the nodes attributes. In addition, the datacontainer currently creates separate nodes named TYPE to store the type of the pyiron objects which are stored as a collection of HDF5 groups and nodes, which each node being stored via h5io. In a similar way we should define the conversion to formats like JSON which have an even more restrictive representation. In h5io the different storage options are stored in the _triage_write() function.

Specify storage usability

5fb79b5

liamhuber added 2 commits October 1, 2024 09:29

Merge branch 'main' into storage_usability

920ac98

Liam's dream storage spec

ae87ca6

JNmpi and others added 2 commits October 2, 2024 13:43

Update storage.md

d01215b

added concept of default data types that automatically support (json, hdf5) storage

Merge branch 'main' into storage_usability

4c4e8fc

jan-janssen approved these changes Oct 5, 2024

View reviewed changes

liamhuber merged commit 68ba4d3 into main Oct 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify storage usability #1

Specify storage usability #1

liamhuber commented Sep 30, 2024

samwaseda commented Sep 30, 2024

liamhuber commented Sep 30, 2024

samwaseda commented Sep 30, 2024

liamhuber commented Sep 30, 2024

jan-janssen commented Sep 30, 2024

samwaseda commented Sep 30, 2024

jan-janssen commented Sep 30, 2024

jan-janssen commented Sep 30, 2024

samwaseda commented Oct 1, 2024 •

edited

Loading

jan-janssen commented Oct 1, 2024

samwaseda commented Oct 1, 2024

samwaseda commented Oct 1, 2024

liamhuber commented Oct 1, 2024

ligerzero-ai commented Oct 1, 2024

liamhuber commented Oct 1, 2024 •

edited

Loading

jan-janssen left a comment

jan-janssen commented Oct 5, 2024

Specify storage usability #1

Specify storage usability #1

Conversation

liamhuber commented Sep 30, 2024

samwaseda commented Sep 30, 2024

liamhuber commented Sep 30, 2024

samwaseda commented Sep 30, 2024

liamhuber commented Sep 30, 2024

jan-janssen commented Sep 30, 2024

samwaseda commented Sep 30, 2024

jan-janssen commented Sep 30, 2024

jan-janssen commented Sep 30, 2024

samwaseda commented Oct 1, 2024 • edited Loading

jan-janssen commented Oct 1, 2024

samwaseda commented Oct 1, 2024

samwaseda commented Oct 1, 2024

liamhuber commented Oct 1, 2024

ligerzero-ai commented Oct 1, 2024

liamhuber commented Oct 1, 2024 • edited Loading

jan-janssen left a comment

Choose a reason for hiding this comment

jan-janssen commented Oct 5, 2024

samwaseda commented Oct 1, 2024 •

edited

Loading

liamhuber commented Oct 1, 2024 •

edited

Loading