Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best traineddata feedback - Fraktur #65

Open
stweil opened this issue Aug 1, 2017 · 4 comments
Open

Best traineddata feedback - Fraktur #65

stweil opened this issue Aug 1, 2017 · 4 comments

Comments

@stweil
Copy link
Member

stweil commented Aug 1, 2017

From issue #62:

The new files include two files for German Fraktur: best/Fraktur.traineddata and best/frk.traineddata. According to my first tests, both are better than the old deu_frak.traineddata and much better than the old frk.traineddata. There is not a clear winner for the two new files: in some cases -l Fraktur gives better results, in some other cases -l frk is better. Even a 3.05 based Fraktur model still is better for some words, but generally the new LSTM based models win the challenge.

Ray, it would be interesting to know the training differences of the two new Fraktur traineddata files. Did they use different fonts / training material / dictionaries?

@stweil
Copy link
Member Author

stweil commented Aug 1, 2017

The new best/Fraktur.traineddata contains a word list (dictionary) with 897964 entries. It can be extracted like this:

combine_tessdata -u /usr/local/share/tessdata/Fraktur.traineddata Fraktur.
dawg2wordlist Fraktur.lstm-unicharset Fraktur.lstm-word-dawg wordlist

A short (still incomplete) review of that list shows lots of issues:

  • At least the important paragraph character § (maybe others, too) is missing in that list.
  • The list contains lots of strange "words", for example °*°*° or ©.
  • Many words (but not all) occur twice, once in their normal case and once in upper case.
  • Many entries are root domains like youtube.com, Youtube.com, YouTube.com, YouTube.COM and YOUTUBE.COM. Neither of those entries is common in historic texts which typically use Fraktur.
  • The list contains modern words like Internet which typically don't occur in Fraktur texts.
  • Many words seem to be Dutch, French and other languages, but not German.
  • The list contains words which are definitely wrong, for example Abhiingigkeit, Abhngigkeit or Abh/ngigkeit instead of Abhängigkeit.
  • Many words are wrong because they confuse ß and B. Example: blaB (wrong) instead of blaß (correct). See also previous commits like this one for langdata.

2017-09-11

  • The wordlist includes words with ii instead of the correct ü (for example "fiir" instead of "für").

@amitdo
Copy link

amitdo commented Aug 1, 2017

Many words (but not all) occur twice, once in their normal case and once in upper case.

Same as in the old (and most likely new) eng.traineddata. Seems to be normal.

@stweil
Copy link
Member Author

stweil commented Aug 1, 2017

In this case "normal" leads to unwanted effects. Tesseract uses those entries to decide about OCR results, and I see many of those uppercase words in my real OCR results. In most cases they are completely wrong (see for example these historic texts with COMPUTER).

If there is a need for uppercase words in some rare cases, I'd expect that those words could be generated programmatically from the normal form. I see no need to fill the word list with them.

@stweil
Copy link
Member Author

stweil commented Sep 21, 2017

List of important missing characters in Fraktur.lstm-unicharset: paragraph §, tilde ~.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants