Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDF5 dataset format: how to convert #317

Open
bertsky opened this issue Apr 27, 2022 · 4 comments
Open

HDF5 dataset format: how to convert #317

bertsky opened this issue Apr 27, 2022 · 4 comments
Labels
enhancement New feature or request performance Concerns the computational efficiency

Comments

@bertsky
Copy link
Collaborator

bertsky commented Apr 27, 2022

I presume training on HDF5 will be more efficient than any of the other formats. And at least against the line GT file pairs, filesystem performance might be much better, too.

So my question is: how do I convert existing datasets into HDF5 format?

@andbue
Copy link
Member

andbue commented Apr 27, 2022

Hi Robert, at the moment there is no script that converts data from the command line. When running Cross-fold-train, the data is copied to hdf5 before the training starts, have a look here:

# else load the data of each fold and write it to hd5 data files
with ExitStack() as stack:
folds = [
stack.enter_context(Hdf5DatasetWriter(os.path.join(self.output_dir, "fold{}".format(i))))
for i in range(self.n_folds)
]
for i, sample in tqdm_wrapper(
enumerate(data_generator.generate()),
progress_bar=progress_bar,
total=len(data_generator),
desc="Creating hdf5 files",
):
sample: Sample = sample
folds[i % self.n_folds].write(sample.inputs, sample.targets)

For my own training, I've hacked together some lines of code at https://github.com/andbue/nashi/blob/master/ocr/nashi_ocr/nashi_client.py to save preprocessed data in a single hdf5 file, so I can re-run training and prediction the need for preprocessing the images again. If I had the time, it would be sensible to integrate some of that into calamari, I guess.

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 27, 2022

Hi Andreas – thanks for your fast feedback!

I think I understood the writer part, but could you please fill me in on the reader side (for file pairs)? What's the minimal / best pattern to instantiate a data generator – scripts.dataset_viewer.DataWrapper perhaps?

@andbue
Copy link
Member

andbue commented Apr 27, 2022

That's where I would have started as well. Maybe a copy of dataset_viewer.py, setting PipelineMode.EVALUATION, writing sample.inputs and sample.targets to the Hdf5DatasetWriter instead of showing them in pyplot. If I'm not totally mistaken, this should work with all kinds of datasets. Just in case you end up with something helpful for other users as well: feel free to put it in a PR!

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 27, 2022

Understood, thanks! I'll give it a try.

@bertsky bertsky added enhancement New feature or request performance Concerns the computational efficiency labels Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Concerns the computational efficiency
Projects
None yet
Development

No branches or pull requests

2 participants