Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify storage usability #1

Merged
merged 5 commits into from
Oct 5, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 50 additions & 19 deletions storage.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,59 @@
# Storage Interface
The storage interface is designed to store input and output data for long-term storage and archiving. The storage interface offers three different storage modes:

- [HDF5](url_for_hdf)
- [JSON](url_for_json)
- [cloudpickle](url_for_cloudpickle)
## Requirements and promises for users

The storage interface requires the data in the form of `dict` and the directory path to store the data. It stores and loads the data via the following two functions:
Users should be able to store any python object which could be pickled.

```python
def save(
data: dict,
path: str,
mode: str = "hdf5"
)
```
They can modify the storage process by modifying the standard routines used by `pickle`, i.e. `__reduce__`, `__getstate__`, `__setstate__`, `__getnewargs__`, `__getnewargs_ex__`, and/or `__reduce_ex__`.

```python
def load(
path: str
mode: str = "hdf5"
) -> dict
```
In addition to whatever other benefits a particular backend brings, python objects should be able to be (de)serialized using these methods in a way that is consistent with the behavior of `pickle`.

> `cloudpickle` idea: Users always get successful storage, even for un-pickleable object -- but at the cost that they might lose benefits of their backend as we simply `cloudpickle` their object and then store the bytestream

pyiron is designed to provide default serialization to formats suitable for long-term storage such as hdf5 or json, without requiring the user to write/modify storage routines such as `__getstate__` etc. The fallback to (cloud) pickle should be avoided whenever possible. The following types are supported:

- basic Python types (int, float, str, etc.) (works out of the box)
- dataclasses consisting of fields that are serializable by default (json, hdf5, etc.)
- such dataclasses are saved/loaded via their import path (-> works for dataclass definitions that are file-based, i.e. in a Python module)
- works recursively, i.e. a dataclass field can be a dataclass
- Function nodes are stored by converting input and output into dictionaries (or dataclasses). The function of the node is stored/loaded via its import path (equivalent to dataclasses).
- Types that cannot be serialized by default
- Callables/Objects
- Workaround: store/load such data via a generator node, where the node only requires standard input.
- Example: instead of storing an ASE instance, store the generator, e.g. atomistic.structure.built.bulk('Al'). The generator's input, such as the species ('Al'), has only standard types

## Requirements for the user-facing interface

The pyironic storage interface should offer at least `save(obj: typing.Any, filename: pathlib.Path | str, **kwargs)`, `load(filename: pathlib.Path | str, **kwargs)`, and `delete(filename: pathlib.Path | str, **kwargs)` interfaces for (de)serializing objects.
Interfaces to particular back-ends may allow separate kwargs (e.g. HDF5 specifying a path to use inside a given file).

> `cloudpickle` idea: `cloudpickle_fallback: bool` is a kwarg for all interfaces? When `True`, the back-end must fall back on `cloudpickle` to package the object and store the bytes, but when `False` an un-pickleable object raises an exception?

> `back-ends` idea: The user-facing interface could let the user see (e.g. by tab completion) which different back-ends are currently available and easily choose between them?

## General requirements for back-ends

A pyironic storage back-end must offer at least `dump(obj: typing.Any, filename: pathlib.Path | str, **kwargs)`, `load(filename: pathlib.Path | str, **kwargs)`, and be capable of storing any object that could be succesfully pickled.
From its dumped file, it must be able to re-instantiate the python object that was saved.

> `cloudpickle` idea: back-ends must be able to fall-back on cloudpickling the object and storing the bytes if they would otherwise fail?

Back-ends must be able to secure the data they need for the dump/load cycle from the standard `pickle` methods of `__reduce__`, `__getstate__`, `__setstate__`, `__getnewargs__`, `__getnewargs_ex__`, and/or `__reduce_ex__`.

Examples of possible back-end formats:

- [HDF5](url_for_hdf)
- [JSON](url_for_json)
- [cloudpickle](url_for_cloudpickle)

### Our dream back-end

While all three modes are meant for long-term storage, the storage interface guarantees the validity of the stored data as long as the corresponding modules promise. For more information, take a look at their specs.
Our dream storage interface is designed to store input and output data for long-term storage and archiving.
It should allow loading partial objects, or saving to an existing file to partially update a saved object.
It should facilitate browsing/lazy loading -- i.e. we can see what is stored in it and metadata about the stored object without fully reinstating the (partial) object as a python object.
It stores as much versioning information as possible (module version, git hash if module is in a git repo, maybe even a hash of the raw source code?), and gives users some freedom for how strictly they want to enforce versioning at load time (ranging from "just go for it", to "look at the metadata of what is about to be loaded -- does my current environment match that metadata? If not throw an exception!").
It is fast (save/load cycle comparable to `pickle`).
It is memory efficient (storage footprint comparable to `pickle`).


# Tinybase Interface
Expand Down