Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify storage usability #1

Merged
merged 5 commits into from
Oct 5, 2024
Merged

Specify storage usability #1

merged 5 commits into from
Oct 5, 2024

Conversation

liamhuber
Copy link
Member

What should be the requirement for something to be usable in our storage? Just like we follow concurrent.futures.Executor, I think we should follow pickle for requiring inputs. Of course what the output should look like (hierarchical, etc.) is separate

@samwaseda
Copy link
Member

I didn't notice that you were editing the same file XD

I think we should follow pickle for requiring inputs

What do you mean by this?

@liamhuber
Copy link
Member Author

I think we can merge yours first, then break the doc into two parts: (1) what objects are allowed to be stored? (what my PR concerns itself with) and (b) what do they look like/what can we do with them after they're stored? (what your PR concerns itself with)

I should also be more precise here. I mean to say that the back-ends should leverage __reduce__, or, at a minimum, be able to store anything that pickle would be able to store by leveraging __reduce__.

@samwaseda
Copy link
Member

I should also be more precise here. I mean to say that the back-ends should leverage __reduce__, or, at a minimum, be able to store anything that pickle would be able to store by leveraging __reduce__.

ok but doesn't it mean that my definition of dict is too restrictive? Shouldn't it be replaced by __reduce__?

Actually I'm gonna merge mine anyway and we can change it.

@liamhuber
Copy link
Member Author

Yes and also yes 😂

@jan-janssen
Copy link
Member

Can we start a bit more abstract?

@samwaseda
Copy link
Member

Can we start a bit more abstract?

I don't know what you mean. Feel free to append it to the text.

Yes, I stopped working today after my first PR but otherwise I would have added it. I guess you can safely add this part.

  • The developer should have the option to overload the default interface of the standard library to provide a serialisation specifically for pyiron. For example in addition to __reduce__ and __setstate__ objects in pyiron_base currently implement to_dict() and from_dict() to provide a reduced representation.

I would say that's a requirement for the tools using the storage interface and not part of the specs for the storage interface.

  • We would like to support lazy loading when the developers for the specific objects implement it

It's an excellent point, but we have to properly define what lazy loading is.

  • We would like to support multiple file formats (HDF5, JSON, ...). While the developers can provide optimised storage routines, we could also provide standardised converters to convert from the python data types to the data types of the specific file format, for example in the case of HDF5 these conversions are defined in h5io - https://github.com/h5io/h5io/blob/main/h5io/_h5io.py#L184

YES! I guess we can pretty much take over the same definition for json? I'm ok with almost copy and paste the conversions from h5io or create its own module for json

Overall, I think you can open new PR with these points or add them to this PR.

@jan-janssen
Copy link
Member

Overall, I think you can open new PR with these points or add them to this PR.

I would like to have one PR per topic, this hopefully simplifies the discussion a bit. So if Liam agrees he can either integrate my suggestions or he can merge his pull request and I open a new one with the changes discussed here.

@jan-janssen
Copy link
Member

I would say that's a requirement for the tools using the storage interface and not part of the specs for the storage interface.

To me it is very important that we have both definitions somewhere - maybe even both in this file to define what do we promise for a software using the storage interface and what do we require from a software which wants to use the storage interface.

@samwaseda
Copy link
Member

samwaseda commented Oct 1, 2024

To me it is very important that we have both definitions somewhere - maybe even both in this file to define what do we promise for a software using the storage interface and what do we require from a software which wants to use the storage interface.

ok that's true; I reformulate my point: It sounds to me like it's just a question of either:

save(my_object)

or

save(my_object.to_dict())

Correct? I don't have a very very strong feeling, but to me it sounds like the addition of to_dict and from_dict only complicates the specs with little gain. But as I said, the addition doesn't make it extremely complex, so if you really want to have it inside the storage specs I'm probably ok with that.

@jan-janssen
Copy link
Member

I am strongly in favor of the first option, the question is primarily what happens internally and do we call it save or dump? If the object has an to_dict() function then we want to call to_dict() or potentially even to_hdf() if an object has an optimized storage routine for a specific file format. Otherwise we just try to call __getstate__() or __reduce__() which starting from python 3.11 is defined for every python object. If the storage via __getstate__() / __reduce__() fails we can still use cloud pickle to get a binary representation and then store the binary string, this no longer allows us to access attributes of the stored object without loading the full object but it is at least a fall back option.

Now we have to define to interfaces, how can a developer make sure that their objects can be stored using our storage interface? This should be very simple with minimal requirements. The second interface is: What is required to implement a new file format in our storage interface? How do we define the mapping of data types defined by the file format to the mapping of data types supported by python? Both of these interfaces should be defined in our storage specs.

@samwaseda
Copy link
Member

ok then let's add to_dict.

About dump or save: @ligerzero-ai recently pointed out that you get an answer a lot faster by writing something wrong in the internet than trying to write the right answer. In this spirit my current strategy is to push whatever comes to my mind to be corrected by you guys. So don't take anything I pushed as a fait accompli. I'm equally fine with dump or save.

@samwaseda
Copy link
Member

In other words: Feel free to open new PRs overwriting what I've already pushed. I don't take it personally.

@liamhuber
Copy link
Member Author

I am a strong advocate of the first option. To the extent that I don't want to use the to_dict business at all. What is the advantage of having a whole second to_dict routine where users modify what state is used to save the object vs having them modify __getstate__ (and maybe __setstate__. At first glance I really feel that this complicates the interface to just accommodate some sort of power user that could just use the regular route to begin with if they need special behaviour.

@ligerzero-ai
Copy link
Contributor

the question about dump or save is a question of target demographic;

Who do we envision to be interacting with the code?

Will it be full-on software developers (imho, unlikely) or scientists that know a little bit about programming (which I think is the target audience for pyiron). "save" is linguistically simple and what it does should be easily understood by users. It doesn't require knowledge about python serialisation which usually uses syntax "dump".

So perhaps the easiest way to define syntax and expected interfaces would be which level of user is present at different levels of the interfaces. Then, defining stuff should be easier, since we can say that there is a certain level of expected knowledge regarding users interacting with specific parts of our code.

@liamhuber
Copy link
Member Author

liamhuber commented Oct 1, 2024

Ok, I resolved the merge conflict and updated this with my dream specs.

I would say that's a requirement for the tools using the storage interface and not part of the specs for the storage interface.

To me it is very important that we have both definitions somewhere

I went a step further and broke this down to three levels: user, generic interface, a particular storage back-end. This gives us some nice flexibility, as the user-facing interface can implement things like save and load per @ligerzero-ai's comment, while different back-ends can adhere more closely to pickle and implement dump.

Now we have to define to interfaces, how can a developer make sure that their objects can be stored using our storage interface? This should be very simple with minimal requirements. The second interface is: What is required to implement a new file format in our storage interface? How do we define the mapping of data types defined by the file format to the mapping of data types supported by python? Both of these interfaces should be defined in our storage specs.

IMO this is what kills storage in pyiron_base -- we can't have some limited subset of data types that are handled by a storage tool and everything else isn't; all our storage tools need to just work. For me that means if you can pickle it, you can save it and load it with our tool. Even better, our tool should fall back on cloudpickle and just store the bytes if it fails in its first attack (I call this the "cloudpickle idea" in the text). It's the responsibility of the back-end designer to ensure that their tool is powerful enough to handle (almost) everything.

At the extrema, I could see how some clear subset of objects are blacklisted. E.g. a back-end might not be capable of handling "cyclic" objects (parent.child.parent is parent'); in this case the data type developer can go into things like __getstate__ and __setstate__ to purge and re-instate these sort of cyclic connections. I do this in pyiron_workflow for optimization reasons (don't want to drag the whole graph along when I send a single node to the executor), but ideally a given storage routine shouldn't require it. I would be comfortable making exceptions if the back-end routine is clear enough in what it doesn't support and fails cleanly when a user passes something unsupported in, but I want to work in a blacklisting paradigm not a whitelisting paradigm.

JNmpi and others added 2 commits October 2, 2024 13:43
added concept of default data types that automatically support (json, hdf5) storage
Copy link
Member

@jan-janssen jan-janssen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@jan-janssen
Copy link
Member

Beyond the points included in this pull request:

  • The lazy loading is one of the core concepts of pyiron, so I think we should define them in a bit more detail. What kind of object is returned when an object is lazy loaded from a file. What kind of functionality does the object have after lazy loading? What kind of functionality requires loading the full object? On the pyiron_base side we never really figured this out or the implementation based on the datacontainer resulted in frequent file opening and closing.
  • Can we already in the specs define the conversion of data types which are available in python to the subset of data types which are supported by the different file formats? For example HDF5 provides groups and nodes, with nodes being individual datasets as multi dimensional arrays and groups being collections of nodes or sub-groups. Tools like h5io store the types of the nodes in the nodes attributes. In addition, the datacontainer currently creates separate nodes named TYPE to store the type of the pyiron objects which are stored as a collection of HDF5 groups and nodes, which each node being stored via h5io. In a similar way we should define the conversion to formats like JSON which have an even more restrictive representation. In h5io the different storage options are stored in the _triage_write() function.

@liamhuber liamhuber merged commit 68ba4d3 into main Oct 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants