Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What files go into Training_Data folder, and Validation_Data? #2

Open
ciobania opened this issue Aug 14, 2022 · 1 comment
Open

What files go into Training_Data folder, and Validation_Data? #2

ciobania opened this issue Aug 14, 2022 · 1 comment

Comments

@ciobania
Copy link

Hello,

Thanks for sharing this, and making it easier to work with.

What exactly goes into the .box files? Are those files generated via JTextEditor/QtTextEditor, per line, or per character?

How come we don't need truth files, like the do umentation says - albeit it has not changed in years, so it may be obsolete.

@RawthiL
Copy link
Owner

RawthiL commented Aug 15, 2022

Hello,

The .box files contain whole words, but separated in letters, something like this:

H 237 2686 593 2743 0
E 237 2686 593 2743 0
L 237 2686 593 2743 0
L 237 2686 593 2743 0
O 237 2686 593 2743 0
	 237 2686 593 2743 0
T 242 2625 735 2676 0
H 242 2625 735 2676 0
E 242 2625 735 2676 0
R 242 2625 735 2676 0
E 242 2625 735 2676 0

Those are two separate words, in two different bounding boxes (BBoxes). Each letter has its own entry, but the BBox is the same for the whole words.

I build this .box files using the annotated file from VIA software. The VIA software creates a .json file which I parsed to generate the .box files and the .tiff images.
I tried to use the JTextEditor/QtTextEditor software, but I needed some extra functionality that they did not have (nothing to do with the OCR training). I have no experience ussing them.

Regarding the truth files, I'm not sure which files you mean. I trained the Tesseract 5.x LSTM models, they only need the .box and .tiff to create the .lstm files which are used for training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants