Best traineddata feedback - Fraktur #65

stweil · 2017-08-01T13:25:28Z

From issue #62:

The new files include two files for German Fraktur: best/Fraktur.traineddata and best/frk.traineddata. According to my first tests, both are better than the old deu_frak.traineddata and much better than the old frk.traineddata. There is not a clear winner for the two new files: in some cases -l Fraktur gives better results, in some other cases -l frk is better. Even a 3.05 based Fraktur model still is better for some words, but generally the new LSTM based models win the challenge.

Ray, it would be interesting to know the training differences of the two new Fraktur traineddata files. Did they use different fonts / training material / dictionaries?

stweil · 2017-08-01T13:43:28Z

The new best/Fraktur.traineddata contains a word list (dictionary) with 897964 entries. It can be extracted like this:

combine_tessdata -u /usr/local/share/tessdata/Fraktur.traineddata Fraktur.
dawg2wordlist Fraktur.lstm-unicharset Fraktur.lstm-word-dawg wordlist

A short (still incomplete) review of that list shows lots of issues:

At least the important paragraph character § (maybe others, too) is missing in that list.
The list contains lots of strange "words", for example °*°*° or Â©.
Many words (but not all) occur twice, once in their normal case and once in upper case.
Many entries are root domains like youtube.com, Youtube.com, YouTube.com, YouTube.COM and YOUTUBE.COM. Neither of those entries is common in historic texts which typically use Fraktur.
The list contains modern words like Internet which typically don't occur in Fraktur texts.
Many words seem to be Dutch, French and other languages, but not German.
The list contains words which are definitely wrong, for example Abhiingigkeit, Abhngigkeit or Abh/ngigkeit instead of Abhängigkeit.
Many words are wrong because they confuse ß and B. Example: blaB (wrong) instead of blaß (correct). See also previous commits like this one for langdata.

2017-09-11

The wordlist includes words with ii instead of the correct ü (for example "fiir" instead of "für").

amitdo · 2017-08-01T14:07:06Z

Many words (but not all) occur twice, once in their normal case and once in upper case.

Same as in the old (and most likely new) eng.traineddata. Seems to be normal.

stweil · 2017-08-01T14:27:40Z

In this case "normal" leads to unwanted effects. Tesseract uses those entries to decide about OCR results, and I see many of those uppercase words in my real OCR results. In most cases they are completely wrong (see for example these historic texts with COMPUTER).

If there is a need for uppercase words in some rare cases, I'd expect that those words could be generated programmatically from the normal form. I see no need to fill the word list with them.

stweil · 2017-09-21T16:06:29Z

List of important missing characters in Fraktur.lstm-unicharset: paragraph §, tilde ~.

stweil mentioned this issue Aug 2, 2017

Added best traineddatas for 4.00 alpha #62

Open

Shreeshrii mentioned this issue Jun 3, 2019

Source of scripts/Fraktur etc. tesseract-ocr/tessdata_best#39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best traineddata feedback - Fraktur #65

Best traineddata feedback - Fraktur #65

stweil commented Aug 1, 2017

stweil commented Aug 1, 2017 •

edited

Loading

amitdo commented Aug 1, 2017 •

edited

Loading

stweil commented Aug 1, 2017

stweil commented Sep 21, 2017 •

edited

Loading

Best traineddata feedback - Fraktur #65

Best traineddata feedback - Fraktur #65

Comments

stweil commented Aug 1, 2017

stweil commented Aug 1, 2017 • edited Loading

amitdo commented Aug 1, 2017 • edited Loading

stweil commented Aug 1, 2017

stweil commented Sep 21, 2017 • edited Loading

stweil commented Aug 1, 2017 •

edited

Loading

amitdo commented Aug 1, 2017 •

edited

Loading

stweil commented Sep 21, 2017 •

edited

Loading