-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best Traineddata Feedback - Persian #70
Comments
Are B Nazanin , B Roya unicode fonts? Please try OCR with the latest BEST traineddata. |
Also see |
Does language-specific.sh have the current list of fonts used by your BEST training? |
Those names usually indicating a family of fonts, if you can train a new set of font for Persian, from these page which provides OFL licensed fonts and used heavily on Persian Tex community, download these: XB Zar, XB Roya (equal to B Roya), XB Kayhan (somehow equal to B Nazanin). Also here is another FOSS licensed Persian font bundle contains both Roya and Nazanin (under Nazli name). These ones have less glyph coverage but somehow more standard compliant. You can have B Nazanin and B Roya themselves also but they are not released under a FOSS license, if that matters.
Is LSTM based Persian traineddata released recently? How we can have a look? |
@Shreeshrii ofcourse they are unicode fonts and also more than 90% percent of texts use this font family like B NAZANIN , B yaGHoot , B zar |
@ebraminio salam zaheran shoma irani hastid bezarin ma moshkelemono ba shoma dar miyan bezarim traindatayi ke alan ma dar ekhtiyar darim ba fonte arial fgt dorost kar mikonan va matn hayi ke ba font haye irani mesle B nazanin ya B zar neveshte mishe dorost javab nemidan va kar nemikonan be nazareton chare chiye rahe hali hast? |
|
Please see
https://github.com/tesseract-ocr/tessdata/blob/master/best/fas.traineddata
for the 4.0alpha best model for persian, uploaded by @theraysmith just a
few days back.
Your feedback will help him improve the next version of training for beta
release.
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Sat, Aug 5, 2017 at 3:27 PM, mohammad reza ***@***.***> wrote:
I tested version 4 for attached image. it has these problems
1-doesn't recognize ZWNJ
<https://en.wikipedia.org/wiki/Zero-width_non-joiner>
2-doesn't recognize ●
3-has problem with Ligatures like لا
<https://en.wikipedia.org/wiki/Arabic_alphabet#Ligatures>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#70 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_ozQyB1glSITgP-DcG29KGBcefIpJks5sVDyngaJpZM4OuZjP>
.
|
@reza1615 use vietocr 5 alpha you can get good result with it |
@aidinkrmz also I tested vietocr 5 alpha. it has the same problem (you can test the attached image). vietocr is only an interface. |
Everyone who tests the new best traineddata should also update Tesseract to the latest commit. |
is it the last version? tesseract-ocr-setup-4.0.0-alpha.20170804.exe I used this version and downloaded the data from the installation wizard |
Yes, that version is the last version. |
@reza1615 but i test your pic without any error !!!!!!!! |
@aidinkrmz for me it recognizes ساختمانها as ساختمانها |
@reza1615 agha mohammad reza aziz 2 3 % khata ghabele hale kamel ke dg nabayad tashkhis bede ke on sakhtoman ha ham 2 3 darsad irade |
@aidinkrmz you said it doesn't have any error!! now you says it has 2-3 % errors! |
@reza1615 we talk abut traindata file so the traindata file doesn`t have any problem |
Please elaborate on that with some examples and any suggestions you have for fixing. Thanks! |
I attached the output file to compare them |
In my opinion, @theraysmith trained the data from texts which don't have ZWNJ (U+200C). you can use fa.wikipedia Featured articles which uses correct persian handwriting at here and here there are many texts. you can collect the articles from here |
@Shreeshrii: I see LSTM result so convincing here. Keep up the good work :) |
Question from Ray in tesseract-ocr/langdata#72
|
Hi shree i tested BEST fas.traineddata. but it has problem. "لا".it can't recognize it and i did fine tune for "لا". and again had this problem and about xheight, i used Arabic.xheight for persian but after creating unicharset, this file had cleared. |
@theraysmith Any update regarding the new training for RTL languages? |
Hi guys . I'm trying to find the traineddata fot arabic numbers only , can any body guide me where to find it thank you im using tesseract 4 visual studio 2017 c++ the results |
Hi. I tested BEST fas.traineddata but it had some errors. for example it couldn't recognize 'ی' character for some fonts. I integrate some specific fonts such as "B Nazanin" "B Zar" "B Lotus" by fine tuning the pre-training model. After testing the new .traineddata, It could recognize some fonts better than the BEST fas.traineddata. but it couldn't recognize ZWNJ. However with BEST fas.traineddata I could recognize ZWNJ.Nnow my questions are: |
Did you try with script/Arabic.traineddata? |
@Farhodi What training text did you use for fine tuning? did it have any ZWNJ in it? Take a look at the unicharset from your trained data and compare with the one in the repository. Make sure you have all needed characters in your training text. Regarding the font list and training done originally by Ray Smith, we are awaiting updates to langdata. |
@Shreeshrii I've used the same training text in langdata "fas" folder for fine tuning. Just add new fonts for training. Also I couldn't find fas.unicharset at langdata repository to compare with my .unicharset. 108
10 17,102,204,255,74,242,0,23,78,249 Common 95 4 95 # # # [23 ]p… 10 12,102,64,124,114,273,8,37,132,333 Common 96 10 96 ... # … [2026 ]p
Thanks for your reply |
combine_tessdata -u tessdata_best/fas.traineddata fas. This will unpack the traineddata file. Look at fas.lstm-unicharset That probably has the ZWNJ in it. You can add a few additional lines to the training text in langdata which have ZWNJ |
How do I install the |
Please use our forum for asking questions. |
Ref: tesseract-ocr/langdata#76 (comment)
copied below
The text was updated successfully, but these errors were encountered: