-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDF5 dataset format: how to convert #317
Comments
Hi Robert, at the moment there is no script that converts data from the command line. When running Cross-fold-train, the data is copied to hdf5 before the training starts, have a look here: calamari/calamari_ocr/ocr/training/cross_fold.py Lines 77 to 90 in 3b1969b
For my own training, I've hacked together some lines of code at https://github.com/andbue/nashi/blob/master/ocr/nashi_ocr/nashi_client.py to save preprocessed data in a single hdf5 file, so I can re-run training and prediction the need for preprocessing the images again. If I had the time, it would be sensible to integrate some of that into calamari, I guess. |
Hi Andreas – thanks for your fast feedback! I think I understood the writer part, but could you please fill me in on the reader side (for file pairs)? What's the minimal / best pattern to instantiate a data generator – |
That's where I would have started as well. Maybe a copy of dataset_viewer.py, setting PipelineMode.EVALUATION, writing sample.inputs and sample.targets to the Hdf5DatasetWriter instead of showing them in pyplot. If I'm not totally mistaken, this should work with all kinds of datasets. Just in case you end up with something helpful for other users as well: feel free to put it in a PR! |
Understood, thanks! I'll give it a try. |
I presume training on HDF5 will be more efficient than any of the other formats. And at least against the line GT file pairs, filesystem performance might be much better, too.
So my question is: how do I convert existing datasets into HDF5 format?
The text was updated successfully, but these errors were encountered: