Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best Traineddata Feedback - Persian #70

Open
Shreeshrii opened this issue Aug 5, 2017 · 33 comments
Open

Best Traineddata Feedback - Persian #70

Shreeshrii opened this issue Aug 5, 2017 · 33 comments

Comments

@Shreeshrii
Copy link
Contributor

Ref: tesseract-ocr/langdata#76 (comment)

copied below

Hello
I'm a software engineering student and i use tesseract OCR engine in a university project. For persian language, traineddata which it's a file and it made by Training tesseract 4.00 and LSTM method, has a good result and output in Arial fonts but it doesn't have any good result in some specific fonts for persian. So the questions are :
1- did you use specific fonts like B Nazanin , B Roya or etc in Training Tesseract 4.00 with LSTM or not?
2- if they haven't used how can we use these fonts for getting better result?
I prepared a text that all the cases of litrates have repeated for 10 or 15 or more than 15 times in this text. Also i used the process of training tesseract 3.05 for this text but i didn't get better and beneficial output.
For achieving to a good result in persian in Tesseract OCR engine we need your experience and your help.
Thanks for your attention
Sincerely.

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Aug 5, 2017

@aidinkrmz

Are B Nazanin , B Roya unicode fonts? Please try OCR with the latest BEST traineddata.

@Shreeshrii
Copy link
Contributor Author

Also see
#3

tesseract-ocr/langdata#26

@Shreeshrii
Copy link
Contributor Author

@theraysmith

Does language-specific.sh have the current list of fonts used by your BEST training?

@ebraminio
Copy link

ebraminio commented Aug 5, 2017

Are B Nazanin , B Roya unicode fonts? Please try OCR with the latest BEST traineddata.

Those names usually indicating a family of fonts, if you can train a new set of font for Persian, from these page which provides OFL licensed fonts and used heavily on Persian Tex community, download these: XB Zar, XB Roya (equal to B Roya), XB Kayhan (somehow equal to B Nazanin).

Also here is another FOSS licensed Persian font bundle contains both Roya and Nazanin (under Nazli name). These ones have less glyph coverage but somehow more standard compliant.

You can have B Nazanin and B Roya themselves also but they are not released under a FOSS license, if that matters.

Please try OCR with the latest BEST traineddata.

Is LSTM based Persian traineddata released recently? How we can have a look?

@aidinkrmz
Copy link

@Shreeshrii ofcourse they are unicode fonts and also more than 90% percent of texts use this font family like B NAZANIN , B yaGHoot , B zar

@aidinkrmz
Copy link

@ebraminio salam zaheran shoma irani hastid bezarin ma moshkelemono ba shoma dar miyan bezarim traindatayi ke alan ma dar ekhtiyar darim ba fonte arial fgt dorost kar mikonan va matn hayi ke ba font haye irani mesle B nazanin ya B zar neveshte mishe dorost javab nemidan va kar nemikonan be nazareton chare chiye rahe hali hast?

@reza1615
Copy link

reza1615 commented Aug 5, 2017

03
I tested version 4 for attached image. it has these problems
1-doesn't recognize ZWNJ
2-doesn't recognize ●
3-has problem with Ligatures like لا
4- the image's font is B_nazanin
5-doesn't recognize ، ؛ ؟ (\u060C \u061B \u061F)
6- it's dictionary is not completed I suggest to use persian hunspell's data for example it doesn't recognize (ساخت - ناشی ) this data use by chrome (for more information look here)

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Aug 5, 2017 via email

@aidinkrmz
Copy link

@reza1615 use vietocr 5 alpha you can get good result with it

@reza1615
Copy link

reza1615 commented Aug 5, 2017

@aidinkrmz also I tested vietocr 5 alpha. it has the same problem (you can test the attached image). vietocr is only an interface.

@amitdo
Copy link

amitdo commented Aug 5, 2017

Everyone who tests the new best traineddata should also update Tesseract to the latest commit.

@reza1615
Copy link

reza1615 commented Aug 5, 2017

is it the last version? tesseract-ocr-setup-4.0.0-alpha.20170804.exe I used this version and downloaded the data from the installation wizard

@amitdo
Copy link

amitdo commented Aug 5, 2017

Yes, that version is the last version.

@aidinkrmz
Copy link

@reza1615 but i test your pic without any error !!!!!!!!

@reza1615
Copy link

reza1615 commented Aug 5, 2017

@aidinkrmz for me it recognizes ساختمان‌ها as ساختمانها
or ● as @
or لا as ا like بالاتری as باالتری

@aidinkrmz
Copy link

@reza1615 agha mohammad reza aziz 2 3 % khata ghabele hale kamel ke dg nabayad tashkhis bede ke on sakhtoman ha ham 2 3 darsad irade

@reza1615
Copy link

reza1615 commented Aug 5, 2017

@aidinkrmz you said it doesn't have any error!! now you says it has 2-3 % errors!
I didn't say tesseract-ocr has fatal bug I said it should recognize these problems when these bugs could solve why we shouldn't solve them at the program? also, it doesn't recognize ZWNJ and it isn't a minor bug.

@aidinkrmz
Copy link

@reza1615 we talk abut traindata file so the traindata file doesn`t have any problem

@Shreeshrii
Copy link
Contributor Author

@reza1615

it doesn't recognize ZWNJ and it isn't a minor bug.

Please elaborate on that with some examples and any suggestions you have for fixing. Thanks!

@reza1615
Copy link

reza1615 commented Aug 5, 2017

05
@Shreeshrii At this image
yellow: doesn't recognize Zwnj
Red: problem with Ligature لا
Green: problem with ●
Blue: confused . (dot) and ۰ (persian 0)
Perpul: incorrect word

I attached the output file to compare them

@reza1615
Copy link

reza1615 commented Aug 5, 2017

In my opinion, @theraysmith trained the data from texts which don't have ZWNJ (U+200C). you can use fa.wikipedia Featured articles which uses correct persian handwriting at here and here there are many texts. you can collect the articles from here

@ebraminio
Copy link

@Shreeshrii: I see LSTM result so convincing here. Keep up the good work :)

@Shreeshrii
Copy link
Contributor Author

Question from Ray in tesseract-ocr/langdata#72

Anyone know which digits are needed for the other Arabic languages?
kur_ara, pus, uig

@khosrobeygizohre
Copy link

Hi shree i tested BEST fas.traineddata. but it has problem. "لا".it can't recognize it and i did fine tune for "لا". and again had this problem and about xheight, i used Arabic.xheight for persian but after creating unicharset, this file had cleared.

@Shreeshrii
Copy link
Contributor Author

@theraysmith Any update regarding the new training for RTL languages?

@AbdelsalamHaa
Copy link

AbdelsalamHaa commented May 4, 2018

Hi guys . I'm trying to find the traineddata fot arabic numbers only , can any body guide me where to find it thank you

im using tesseract 4 visual studio 2017 c++
i tried using the normal ara.traindata it doesn't seems okay at all
image

the results
الرقم /1 ١ ١5 .//ا1١1؟؟
@theraysmith
@Shreeshrii

@zeinabfarhoudi
Copy link

zeinabfarhoudi commented May 14, 2018

Hi. I tested BEST fas.traineddata but it had some errors. for example it couldn't recognize 'ی' character for some fonts. I integrate some specific fonts such as "B Nazanin" "B Zar" "B Lotus" by fine tuning the pre-training model. After testing the new .traineddata, It could recognize some fonts better than the BEST fas.traineddata. but it couldn't recognize ZWNJ. However with BEST fas.traineddata I could recognize ZWNJ.Nnow my questions are:
1- What fonts did you used for training and making the BEST fas.traineddata?
2- How should I training tesseract-ocr 4.0 to recognize ZWNJ in Persian language?

@Shreeshrii
Copy link
Contributor Author

@AbdelsalamHaa

Did you try with script/Arabic.traineddata?

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented May 14, 2018

@Farhodi What training text did you use for fine tuning? did it have any ZWNJ in it?

Take a look at the unicharset from your trained data and compare with the one in the repository.

Make sure you have all needed characters in your training text.

Regarding the font list and training done originally by Ray Smith, we are awaiting updates to langdata.

@reza6966
Copy link

hi, i test new version of tesseract (4 beta) on persian language. the results its good but there are some errors.
for examples :

  1. in different char that have same shape with different dot location or number of dots. (ex. بـ تـ ثـ یـ or ز ر ژ)
  2. in some cases when there are same word in doc, the results of these same words are different. (ex. word="نویسه")
    0006
  3. i think at the end of process, does not apply dictionary correction, is it true ?
  4. and how could we train more fonts ?

thanks

@zeinabfarhoudi
Copy link

@Shreeshrii I've used the same training text in langdata "fas" folder for fine tuning. Just add new fonts for training. Also I couldn't find fas.unicharset at langdata repository to compare with my .unicharset.
The .unicharset I've used is as follow:

108
NULL 0 NULL 0
Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 65 64 ]a
|Broken|0|1 f 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1 # Broken
و 1 0,68,137,238,65,290,0,27,62,256 Arabic 3 13 3 و # و [648 ]x
ه 1 55,123,147,255,35,181,6,64,48,222 Arabic 4 13 4 ه # ه [647 ]x
ک 1 47,121,200,255,131,288,0,45,124,305 Arabic 5 13 5 ک # ک [6a9 ]x
ن 1 0,88,163,255,68,321,0,52,76,354 Arabic 6 13 6 ن # ن [646 ]x
ی 1 0,71,148,225,95,253,0,45,103,279 Arabic 7 13 7 ی # ی [6cc ]x
ا 1 26,117,200,255,11,181,7,82,33,222 Arabic 8 13 8 ا # ا [627 ]x
خ 1 0,66,172,255,92,262,2,37,84,290 Arabic 9 13 9 خ # خ [62e ]x
س 1 0,64,140,228,123,493,0,50,132,523 Arabic 10 13 10 س # س [633 ]x
ع 1 0,64,148,255,98,239,2,37,81,276 Arabic 11 13 11 ع # ع [639 ]x
ض 1 0,64,174,255,131,619,0,50,132,654 Arabic 12 13 12 ض # ض [636 ]x
م 1 0,64,134,241,51,272,0,46,56,313 Arabic 13 13 13 م # م [645 ]x
ل 1 0,96,200,255,62,328,0,50,71,332 Arabic 14 13 14 ل # ل [644 ]x
ف 1 44,125,202,255,113,339,0,47,123,378 Arabic 15 13 15 ف # ف [641 ]x
ر 1 0,63,137,224,45,297,0,22,59,244 Arabic 16 13 16 ر # ر [631 ]x
پ 1 0,42,142,217,113,258,2,50,123,288 Arabic 17 13 17 پ # پ [67e ]x
د 1 49,123,163,250,43,467,0,70,59,503 Arabic 18 13 18 د # د [62f ]x
ت 1 58,123,170,255,113,339,2,50,123,378 Arabic 19 13 19 ت # ت [62a ]x
. 10 12,108,64,140,18,52,9,77,52,193 Common 20 6 20 . # . [2e ]p
ج 1 0,64,133,255,92,262,2,37,84,290 Arabic 21 13 21 ج # ج [62c ]x
ق 1 0,79,179,255,84,310,0,52,88,345 Arabic 22 13 22 ق # ق [642 ]x
ش 1 0,64,196,255,123,493,0,50,132,523 Arabic 23 13 23 ش # ش [634 ]x
ز 1 0,63,167,255,45,298,0,22,59,242 Arabic 24 13 24 ز # ز [632 ]x
: 10 12,108,157,255,18,58,11,77,52,193 Common 25 6 25 : # : [3a ]p
ب 1 0,71,140,224,113,339,0,50,123,378 Arabic 26 13 26 ب # ب [628 ]x
آ 1 26,117,230,255,36,161,0,58,33,198 Arabic 27 13 27 آ # آ [622 ]x
ي 1 0,56,148,255,95,431,0,45,103,467 Arabic 28 13 28 ي # ي [64a ]x
گ 1 47,125,208,255,131,289,0,45,132,305 Arabic 29 13 29 گ # گ [6af ]x
, 10 0,72,69,140,21,62,8,65,39,193 Common 30 6 30 , # , [2c ]p
غ 1 0,64,196,255,98,239,2,37,81,276 Arabic 31 13 31 غ # غ [63a ]x
ح 1 0,64,133,255,92,262,2,37,84,290 Arabic 32 13 32 ح # ح [62d ]x
= 0 86,150,160,244,90,218,3,33,99,262 Common 33 10 33 = # = [3d ]
} 10 0,67,210,255,37,125,4,46,54,193 Common 34 10 86 } # } [7d ]p
ك 1 49,123,203,255,91,451,0,50,103,483 Arabic 35 13 35 ك # ك [643 ]x
/ 10 12,102,224,255,43,166,0,29,54,193 Common 36 6 36 / # / [2f ]p
٧ 8 58,125,181,255,70,211,0,65,87,270 Common 37 5 37 ٧ # ٧ [667 ]0
٨ 8 58,123,179,255,70,235,0,65,88,270 Common 38 5 38 ٨ # ٨ [668 ]0
٣ 8 55,121,184,255,71,235,0,65,88,338 Common 39 5 39 ٣ # ٣ [663 ]0
١ 8 55,121,184,255,13,134,0,110,57,270 Common 40 5 40 ١ # ١ [661 ]0
٤ 8 60,121,183,255,46,238,0,62,58,270 Common 41 5 41 ٤ # ٤ [664 ]0
ژ 1 0,63,192,255,71,190,0,22,59,193 Arabic 42 13 42 ژ # ژ [698 ]x
چ 1 0,29,133,213,92,192,4,37,84,213 Arabic 43 13 43 چ # چ [686 ]x
ۀ 1 59,121,206,255,40,134,0,20,48,165 Arabic 44 13 44 ۀ # ۀ [6c0 ]x

  • 10 76,186,109,216,42,121,4,55,51,193 Common 45 3 45 - # - [2d ]p
    ظ 1 40,123,200,255,110,574,0,19,113,610 Arabic 46 13 46 ظ # ظ [638 ]x
    ؤ 1 0,68,190,255,70,290,0,27,62,266 Arabic 47 13 47 ؤ # ؤ [624 ]x
    ، 10 30,121,105,221,20,67,5,72,43,193 Common 48 6 48 ، # ، [60c ]p
    ص 1 0,64,143,249,131,619,0,50,132,654 Arabic 49 13 49 ص # ص [635 ]x
    ۸ 8 60,121,179,255,70,176,0,30,88,193 Arabic 50 2 50 ۸ # ۸ [6f8 ]0
    % 10 14,102,204,255,88,377,0,22,132,397 Common 51 4 51 % # % [25 ]p
    0 8 14,100,204,255,65,212,6,29,78,249 Common 52 2 52 0 # 0 [30 ]0
    ط 1 40,123,200,255,110,574,0,20,113,610 Arabic 53 13 53 ط # ط [637 ]x
    ۱ 8 64,121,184,255,22,76,24,75,88,193 Arabic 54 2 54 ۱ # ۱ [6f1 ]0
    ۹ 8 64,121,181,255,56,152,15,46,88,193 Arabic 55 2 55 ۹ # ۹ [6f9 ]0
    ۲ 8 64,121,183,255,50,150,17,42,88,193 Arabic 56 2 56 ۲ # ۲ [6f2 ]0
    ۰ 8 84,155,139,215,24,68,32,77,88,193 Arabic 57 2 57 ۰ # ۰ [6f0 ]0
    ۷ 8 58,121,181,255,70,176,0,37,88,193 Arabic 58 2 58 ۷ # ۷ [6f7 ]0
    ( 10 0,85,197,255,31,108,8,74,63,193 Common 59 10 66 ( # ( [28 ]p
    ] 10 0,73,207,255,31,112,0,53,61,193 Common 60 10 62 ] # ] [5d ]p
    2 8 17,108,204,255,70,215,3,24,78,249 Common 61 2 61 2 # 2 [32 ]0
    [ 10 0,73,209,255,31,112,13,72,61,193 Common 62 10 60 [ # [ [5b ]p
    7 8 17,108,201,255,67,215,4,37,78,249 Common 63 2 63 7 # 7 [37 ]0
    ذ 1 49,123,197,255,43,467,0,70,59,503 Arabic 64 13 64 ذ # ذ [630 ]x
    ؟ 10 0,123,188,255,35,181,5,48,69,222 Common 65 13 65 ؟ # ؟ [61f ]p
    ) 10 0,85,197,255,31,108,0,55,63,193 Common 66 10 59 ) # ) [29 ]p
    1 8 17,108,204,255,42,136,18,43,78,249 Common 67 2 67 1 # 1 [31 ]0
    3 8 14,100,204,255,59,215,4,30,78,249 Common 68 2 68 3 # 3 [33 ]0
    8 8 14,100,204,255,59,218,6,29,78,249 Common 69 2 69 8 # 8 [38 ]0
    4 8 17,108,204,255,71,225,1,22,78,249 Common 70 2 70 4 # 4 [34 ]0
    ة 1 55,123,190,255,40,181,0,60,48,222 Arabic 71 13 71 ة # ة [629 ]x
    ث 1 58,121,192,255,113,339,2,50,123,378 Arabic 72 13 72 ث # ث [62b ]x
    ۵ 8 60,126,177,255,61,161,4,32,88,193 Arabic 73 2 73 ۵ # ۵ [6f5 ]0
    ٠ 8 56,165,126,227,24,99,8,98,56,270 Common 74 5 74 ٠ # ٠ [660 ]0
    » 10 17,140,166,255,70,158,4,43,78,249 Common 75 10 80 » # » [bb ]p
    ۳ 8 64,121,184,255,71,166,8,27,88,193 Arabic 76 2 76 ۳ # ۳ [6f3 ]0
    ء 1 32,112,129,220,43,163,3,67,62,199 Arabic 77 13 77 ء # ء [621 ]x
    ى 1 0,100,148,255,95,431,0,45,103,467 Arabic 78 13 78 ى # ى [649 ]x
    أ 1 26,117,248,255,29,148,0,67,33,193 Arabic 79 13 79 أ # أ [623 ]x
    « 10 17,140,166,255,68,161,4,43,78,249 Common 80 10 75 « # « [ab ]p
    ٢ 8 55,121,183,255,50,214,0,62,77,270 Common 81 5 81 ٢ # ٢ [662 ]0
    5 8 14,100,201,255,62,218,6,37,78,249 Common 82 2 82 5 # 5 [35 ]0
    ئ 1 0,100,185,255,95,431,0,45,103,467 Arabic 83 13 83 ئ # ئ [626 ]x
  • 10 79,248,163,255,68,143,0,41,78,193 Common 84 10 84 * # * [2a ]p
    ۴ 8 64,121,190,255,71,157,7,38,88,193 Arabic 85 2 85 ۴ # ۴ [6f4 ]0
    { 10 0,67,210,255,37,128,2,46,54,193 Common 86 10 34 { # { [7b ]p
    ! 10 12,108,193,255,19,61,15,82,49,193 Common 87 10 87 ! # ! [21 ]p
    9 8 14,100,204,255,65,212,6,29,78,249 Common 88 2 88 9 # 9 [39 ]0
    ٦ 8 55,124,181,255,57,232,0,50,81,270 Common 89 5 89 ٦ # ٦ [666 ]0
    6 8 14,100,204,255,65,215,6,29,78,249 Common 90 2 90 6 # 6 [36 ]0
    ؛ 10 60,119,140,255,20,67,2,72,43,193 Common 91 13 91 ؛ # ؛ [61b ]p
    ۶ 8 60,121,183,255,52,128,7,41,88,193 Arabic 92 2 92 ۶ # ۶ [6f6 ]0
    ٥ 8 58,154,163,255,49,241,0,65,67,270 Common 93 5 93 ٥ # ٥ [665 ]0
  • 0 39,111,184,255,76,218,3,38,99,262 Common 94 3 94 + # + [2b ]

10 17,102,204,255,74,242,0,23,78,249 Common 95 4 95 # # # [23 ]p

… 10 12,102,64,124,114,273,8,37,132,333 Common 96 10 96 ... # … [2026 ]p
٬ 10 62,236,164,255,21,53,9,77,52,193 Arabic 97 5 97 ٬ # ٬ [66c ]p
\ 10 12,102,207,255,42,154,0,57,43,193 Common 98 10 98 \ # \ [5c ]p
" 10 139,254,204,255,42,128,9,55,64,193 Common 99 10 99 " # " [22 ]p
& 10 12,100,192,255,83,266,4,27,121,299 Common 100 10 100 & # & [26 ]p
٫ 10 15,98,97,167,33,103,6,61,52,193 Arabic 101 5 101 ٫ # ٫ [66b ]p
? 10 12,108,204,255,56,195,4,43,70,249 Common 102 10 102 ? # ? [3f ]p
< 0 47,109,188,255,49,218,0,40,78,262 Common 103 10 107 < # < [3c ]
_ 10 0,84,0,102,76,259,0,12,74,249 Common 104 10 104 _ # _ [5f ]p
| 0 0,88,207,255,6,64,12,82,31,193 Common 105 10 105 | # | [7c ]
٪ 10 33,105,213,255,79,205,0,41,101,294 Arabic 106 4 106 ٪ # ٪ [66a ]p

0 47,109,188,255,49,222,3,33,78,262 Common 107 10 103 > # > [3e ]

Thanks for your reply

@Shreeshrii
Copy link
Contributor Author

combine_tessdata -u tessdata_best/fas.traineddata fas.

This will unpack the traineddata file.

Look at fas.lstm-unicharset

That probably has the ZWNJ in it.

You can add a few additional lines to the training text in langdata which have ZWNJ

@NightMachinery
Copy link

How do I install the fas traineddata on macOS? Can someone provide the necessary commands to run?

@amitdo
Copy link

amitdo commented Aug 16, 2022

@NightMachinery,

Please use our forum for asking questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants