Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the fonts fas traineddata #10

Open
TAQBIBT opened this issue Jul 21, 2019 · 4 comments
Open

the fonts fas traineddata #10

TAQBIBT opened this issue Jul 21, 2019 · 4 comments

Comments

@TAQBIBT
Copy link

TAQBIBT commented Jul 21, 2019

Thanks for uploading this trained model - could you possibly provide some info about the training data?

Specifically the fonts used and the text used for fas-script-float

Thanks!

@Shreeshrii
Copy link
Owner

It has been a while since I ran that training and I don't have the files saved.

Going by the commits in the git repo - ie.

67b9593

c50e3a3

4e706d1

I think it was based on finetuning (for impact) the tessdata_best/script/Arabic model. I had added Arabic comma and other punctuation to the training_text and not included the English letters [a-zA-Z] in the unicharset. The font used was most probably Arial Unicode MS.

@Shreeshrii
Copy link
Owner

Please see tesseract-ocr/tessdata#70

Possibly I used the fonts recommened on that page - Roya, Nazanin etc.

@anergui
Copy link

anergui commented Jul 22, 2019

Thanks Shreeshrii
please I can not train the Arabic language with OCRD-train that you have proposed on this link: https://github.com/Shreeshrii/ocrd-train
are tiff and gt.txt files prepared like LTR languages or not?
can i start with traineddata that you have proposed example fas-script-float?

Sorry for the inconvenience

@Shreeshrii
Copy link
Owner

fas-script-float is for Persian/Farsi. The numerals for Farsi and Arabic are different.
But it is a float model, similar to the tessdata_best and can be used as base for further training.

Regarding ocrd-train, I only have a fork of the project, with a suggested change to makefile to use 'wordstrbox' option for creating box files for complex scripts.

However, I have not personally tried it for Arabic, as I do not know the language/script and so it is difficult for me to ascertain that it us working correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants