Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source of scripts/Fraktur etc. #39

Closed
mikegerber opened this issue Jun 3, 2019 · 8 comments
Closed

Source of scripts/Fraktur etc. #39

mikegerber opened this issue Jun 3, 2019 · 8 comments

Comments

@mikegerber
Copy link

mikegerber commented Jun 3, 2019

While the files in the top directory seem to come from the sources in the langdata repository, the source for some of the files in scripts/ is unclear:

  • scripts/Fraktur.traineddata has no matching file in langdata,
  • scripts/Japanese.traineddata also, etc.

The Data-Files wiki article does not mention scripts/Fraktur.

This adds to the confusion of the frk language (not actually frankish, but Fraktur), the Fraktur script and the legacy model deu_frak in the tessdata repository.

@Shreeshrii
Copy link
Contributor

Also see tesseract-ocr/tessdata#65

@mikegerber
Copy link
Author

Is langdata obsolete as langdata_lstm exists?

@Shreeshrii
Copy link
Contributor

langdata files are appropriate for tesseract 3 or for legacy/base versions using tesseract 4. They can also be used for finetuning which requires a smaller input training text.

@stweil
Copy link
Member

stweil commented Jun 13, 2019

As @Shreeshrii already said, langdata_lstm is for LSTM models while langdata is for legacy models. Both kinds of models are still used.

The scriptmodels are mixtures of different languages. script/Fraktur for example combines enm+frm+frk+ita_old+spa_old.

I fixed the description for 4.00 frk in the Wiki. The other Wiki issues are still open.

@Akossimon
Copy link

Akossimon commented Oct 1, 2019

Fraktur Tesseract OCR is what I am looking for,.... I installed VietOCR v5.5.2 and Tesseract 4.1.0 on my mac, and now I am trying to find help on how to train it better.... there are too many OCR errors...

How would I go about training the software? Can anyone help?

I am a total retard, ...sadly,.... and I do not even know how I was able to install the two components so far..... and this training step is nowhere explained

Any help into the right direction would be greatly appreciated

@stweil
Copy link
Member

stweil commented Nov 11, 2019

In the meantime newer Fraktur models are available. There is a description of the training process for those models in the Wiki.

As soon as the training is finished, I'll add the results to tessdata_contrib.

@stweil
Copy link
Member

stweil commented Jan 24, 2020

@mikegerber, can we close this issue?

@stweil stweil closed this as completed May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants