HDF5 dataset format: how to convert #317

bertsky · 2022-04-27T13:09:32Z

I presume training on HDF5 will be more efficient than any of the other formats. And at least against the line GT file pairs, filesystem performance might be much better, too.

So my question is: how do I convert existing datasets into HDF5 format?

andbue · 2022-04-27T14:18:56Z

Hi Robert, at the moment there is no script that converts data from the command line. When running Cross-fold-train, the data is copied to hdf5 before the training starts, have a look here:

calamari/calamari_ocr/ocr/training/cross_fold.py

Lines 77 to 90 in 3b1969b

    
           # else load the data of each fold and write it to hd5 data files 
        
           with ExitStack() as stack: 
        
               folds = [ 
        
                   stack.enter_context(Hdf5DatasetWriter(os.path.join(self.output_dir, "fold{}".format(i)))) 
        
                   for i in range(self.n_folds) 
        
               ] 
        
               for i, sample in tqdm_wrapper( 
        
                   enumerate(data_generator.generate()), 
        
                   progress_bar=progress_bar, 
        
                   total=len(data_generator), 
        
                   desc="Creating hdf5 files", 
        
               ): 
        
                   sample: Sample = sample 
        
                   folds[i % self.n_folds].write(sample.inputs, sample.targets)

For my own training, I've hacked together some lines of code at https://github.com/andbue/nashi/blob/master/ocr/nashi_ocr/nashi_client.py to save preprocessed data in a single hdf5 file, so I can re-run training and prediction the need for preprocessing the images again. If I had the time, it would be sensible to integrate some of that into calamari, I guess.

bertsky · 2022-04-27T17:51:32Z

Hi Andreas – thanks for your fast feedback!

I think I understood the writer part, but could you please fill me in on the reader side (for file pairs)? What's the minimal / best pattern to instantiate a data generator – scripts.dataset_viewer.DataWrapper perhaps?

andbue · 2022-04-27T18:16:04Z

That's where I would have started as well. Maybe a copy of dataset_viewer.py, setting PipelineMode.EVALUATION, writing sample.inputs and sample.targets to the Hdf5DatasetWriter instead of showing them in pyplot. If I'm not totally mistaken, this should work with all kinds of datasets. Just in case you end up with something helpful for other users as well: feel free to put it in a PR!

bertsky · 2022-04-27T18:30:29Z

Understood, thanks! I'll give it a try.

bertsky added enhancement New feature or request performance Concerns the computational efficiency labels Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDF5 dataset format: how to convert #317

HDF5 dataset format: how to convert #317

bertsky commented Apr 27, 2022

andbue commented Apr 27, 2022

bertsky commented Apr 27, 2022

andbue commented Apr 27, 2022

bertsky commented Apr 27, 2022

HDF5 dataset format: how to convert #317

HDF5 dataset format: how to convert #317

Comments

bertsky commented Apr 27, 2022

andbue commented Apr 27, 2022

bertsky commented Apr 27, 2022

andbue commented Apr 27, 2022

bertsky commented Apr 27, 2022