-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Require Feedback Regarding "DV" or Dhivehi language. (Low translation accuracy also needs proper font) #43
Comments
div.traineddata was added to the repo |
@Xayaan Have you tried the 'best' version? Any feedback? |
@Shreeshrii yes i have tried and the results are pretty bad. It needs to be trained intensively. |
Do not close the issue then, you can change title to say feedback regarding Dhivehi and then add some notes regarding what is wrong so that it can be improved. Also see tesseract-ocr/langdata#52 |
Done, thank you! 👍 |
Are these fonts suitable for Dhivehi ? |
Yes, but I'd trust these : https://dhivehi.mv/fonts/ |
https://dv.wikipedia.org/wiki/%DE%89%DE%A6%DE%87%DE%A8_%DE%9E%DE%A6%DE%8A%DE%B0%DE%99%DE%A7 @theraysmith |
There is also Thaana traineddata |
@Xayaan please check with Thaana traineddata also. If possible, provide an image and it's corresponding ground truth file for testing. |
Yes, it is. It is similiar to sanskrit and arabic. A RTL language. I checked with the thaana traineddata, its not very accurate and has low accuracy now. |
any pointers to the training data used for Thaana? |
The langdata repo has not been updated for 4.0x
https://github.com/tesseract-ocr/langdata/tree/master/tha
has the old training files
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Mon, Oct 2, 2017 at 4:33 PM, Sofwath ***@***.***> wrote:
any pointers to the training data used for Thaana?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#43 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o7xHoqhI_E9OqhBtcoCFvEw1RJj2ks5soML3gaJpZM4LW3EA>
.
|
it looks like that is for Thai. How about for Thaana (div) |
Sorry about that.
Looks like https://github.com/tesseract-ocr/langdata/tree/master/div
does not have all the required files.
If it is similar to Arabic, you can copy langdata files from there and
modify for Thaana.
http://crubadan.org/languages/dv
could be a source for wordlists, training text.
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Mon, Oct 2, 2017 at 4:42 PM, Sofwath ***@***.***> wrote:
it looks like that is for Thai. How about for Thaana (div)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#43 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o-XS8WRBIq1Q-l9tkRTgsmzrCrMNks5soMUvgaJpZM4LW3EA>
.
|
Link to Thaana (div) monogram and bigram file https://github.com/Sofwath/thaanaOCR/tree/master/data This is a Thaana text corpus https://www.dropbox.com/s/04ox44rfuqm5xhw/dv_MV_1.txt?dl=0 Anything else that we need to have for a basic training ? |
You can download
https://github.com/tesseract-ocr/tessdata_best/blob/master/div.traineddata
and
https://github.com/tesseract-ocr/tessdata_best/blob/master/Thaana.traineddata
Then unpack the traineddata to get the files.
root@All-in-1-Touch:/mnt/c/Users/User/shree/tessdata_best# combine_tessdata
-u div.traineddata div.
Extracting tessdata components from div.traineddata
Wrote div.lstm
Wrote div.lstm-punc-dawg
Wrote div.lstm-word-dawg
Wrote div.lstm-number-dawg
Wrote div.lstm-unicharset
Wrote div.lstm-recoder
Wrote div.version
Version string:4.00.00alpha:div:synth20170629
17:lstm:size=3218139, offset=192
18:lstm-punc-dawg:size=4506, offset=3218331
19:lstm-word-dawg:size=1342450, offset=3222837
20:lstm-number-dawg:size=426, offset=4565287
21:lstm-unicharset:size=7276, offset=4565713
22:lstm-recoder:size=1093, offset=4572989
23:version:size=30, offset=4574082
root@All-in-1-Touch:/mnt/c/Users/User/shree/tessdata_best# combine_tessdata
-u Thaana.traineddata Thaana.
Extracting tessdata components from Thaana.traineddata
Wrote Thaana.lstm
Wrote Thaana.lstm-punc-dawg
Wrote Thaana.lstm-word-dawg
Wrote Thaana.lstm-number-dawg
Wrote Thaana.lstm-unicharset
Wrote Thaana.lstm-recoder
Wrote Thaana.version
Version string:4.00.00alpha:Thaana:synth20170629
17:lstm:size=7723707, offset=192
18:lstm-punc-dawg:size=5674, offset=7723899
19:lstm-word-dawg:size=5036906, offset=7729573
20:lstm-number-dawg:size=4762, offset=12766479
21:lstm-unicharset:size=10741, offset=12771241
22:lstm-recoder:size=1633, offset=12781982
23:version:size=33, offset=12783615
root@All-in-1-Touch:/mnt/c/Users/User/shree/tessdata_best#
You can further get the original wordlists by using
dawg2wordlist
But the actual training_text will not be there.
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Mon, Oct 2, 2017 at 4:49 PM, ShreeDevi Kumar <[email protected]>
wrote:
Sorry about that.
Looks like https://github.com/tesseract-ocr/langdata/tree/master/div
does not have all the required files.
If it is similar to Arabic, you can copy langdata files from there and
modify for Thaana.
http://crubadan.org/languages/dv
could be a source for wordlists, training text.
ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Mon, Oct 2, 2017 at 4:42 PM, Sofwath ***@***.***> wrote:
> it looks like that is for Thai. How about for Thaana (div)
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#43 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AE2_o-XS8WRBIq1Q-l9tkRTgsmzrCrMNks5soMUvgaJpZM4LW3EA>
> .
>
|
Great. Thanks. Will work on that. |
dawg2wordlist syntax will be as follows
$ dawg2wordlist Thaana.lstm-unicharset Thaana.lstm-word-dawg Thaana.wordlist
Loading word list from Thaana.lstm-word-dawg
Reading squished dawg
Word list loaded.
similarly for punc and numbers.
You can review these files for accuracy.
I don't think tesseract uses unigrams and bigrams for training, though they
maybe used internally at Google to generate a representative training text.
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Mon, Oct 2, 2017 at 4:58 PM, Sofwath ***@***.***> wrote:
Great. Thanks. Will work on that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#43 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o1GllmCT36G1YsifziNv4OYtmh95ks5soMjxgaJpZM4LW3EA>
.
|
FYI Thaana files will have both English and Divehi. div files will have
only Divehi.
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Mon, Oct 2, 2017 at 5:03 PM, ShreeDevi Kumar <[email protected]>
wrote:
dawg2wordlist syntax will be as follows
$ dawg2wordlist Thaana.lstm-unicharset Thaana.lstm-word-dawg
Thaana.wordlist
Loading word list from Thaana.lstm-word-dawg
Reading squished dawg
Word list loaded.
similarly for punc and numbers.
You can review these files for accuracy.
I don't think tesseract uses unigrams and bigrams for training, though
they maybe used internally at Google to generate a representative training
text.
ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Mon, Oct 2, 2017 at 4:58 PM, Sofwath ***@***.***> wrote:
> Great. Thanks. Will work on that.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#43 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AE2_o1GllmCT36G1YsifziNv4OYtmh95ks5soMjxgaJpZM4LW3EA>
> .
>
|
question: do i still need to create the box files even if we are using the lstm method? |
You have to use tesstrain.sh script file, also see tesstrain_utils.sh and
language_specific.sh in training directory.
These create the box/tiff files from the training text and specified fonts.
They are used for creating the lstmf files and are kept only in the tmp
directory.
Try the training tutorial for english and look at the log file and tmp
directory.
…On 03-Oct-2017 1:18 PM, "Sofwath" ***@***.***> wrote:
question: do i still need to create the box files even if we are using the
lstm method?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#43 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o1DHqtoDtJIAK-fbOp_K6FyUy6W_ks5soea9gaJpZM4LW3EA>
.
|
Any help? I am getting this error sudo /Users/sofwath/tesseract/training/tesstrain.sh --fonts_dir /Users/sofwath/dev/MLAI/tesseract/font/ --lang div --linedata_only --noextract_font_properties --langdata_dir langdata --tessdata_dir tessdata/ --output_dir divtrain/ === Starting training for language 'div' === Phase I: Generating training images === |
You are getting error
mktemp: illegal option -- -
usage: mktemp [-d] [-q] [-t prefix] [-u] template ...
mktemp [-d] [-q] [-u] -t prefix
…------------
see tesstrain_utils.sh
lines 29 and 172
training uses the /tmp directory for creating files
ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Wed, Oct 4, 2017 at 11:20 AM, Sofwath ***@***.***> wrote:
Any help? I am getting this error
sudo /Users/sofwath/tesseract/training/tesstrain.sh --fonts_dir
/Users/sofwath/dev/MLAI/tesseract/font/ --lang div --linedata_only
--noextract_font_properties --langdata_dir langdata --tessdata_dir
tessdata/ --output_dir divtrain/
=== Starting training for language 'div'
mktemp: illegal option -- -
usage: mktemp [-d] [-q] [-t prefix] [-u] template ...
mktemp [-d] [-q] [-u] -t prefix
[Wed Oct 4 10:49:19 +05 2017] /usr/local/bin/text2image
--fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/ --font=MV Typewriter
--outputbase=/sample_text.txt --text=/sample_text.txt --fontconfig_tmpdir=
=== Phase I: Generating training images ===
Rendering using MV Typewriter
[Wed Oct 4 10:49:20 +05 2017] /usr/local/bin/text2image
--fontconfig_tmpdir= --fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/
--strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0
--outputbase=/var/folders/zz/zyxvpxvq6csfxvn_
n0000000000000/T/tmp.GrmD2E1v/div/div.MV_Typewriter.exp0 --max_pages=3
--font=MV Typewriter --text=langdata/div/div.training_text
ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.GrmD2E1v/div/div.MV_Typewriter.exp0.box
does not exist or is not readable
ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.GrmD2E1v/div/div.MV_Typewriter.exp0.box
does not exist or is not readable
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#43 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o4EjWq8HH-tVDxPhm-dAtx2zepsCks5soxyOgaJpZM4LW3EA>
.
|
changed === Starting training for language 'div' === Phase I: Generating training images === |
You have to look at your paths
'--text' option is missing!
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Wed, Oct 4, 2017 at 2:37 PM, Sofwath ***@***.***> wrote:
changed
export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX)
to
export FONT_CONFIG_CACHE=$(mktemp -d -tmpdir font_tmp.XXXXXXXXXX)
in tesstrain_utils.sh and the first error was fixed but still get
=== Starting training for language 'div'
/Users/sofwath/tesseract/training/tesstrain_utils.sh: line 197:
${sample_path}: ambiguous redirect
[Wed Oct 4 14:05:34 +05 2017] /usr/local/bin/text2image
--fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/ --font=MV Typewriter
--outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/mpdir.8n1I8qtq
font_tmp.f7jJ5RoYKx/sample_text.txt --text=/var/folders/zz/
zyxvpxvq6csfxvn_n0000000000000/T/mpdir.8n1I8qtq
font_tmp.f7jJ5RoYKx/sample_text.txt --fontconfig_tmpdir=/var/
folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/mpdir.8n1I8qtq
font_tmp.f7jJ5RoYKx
'--text' option is missing!
=== Phase I: Generating training images ===
Rendering using MV Typewriter
[Wed Oct 4 14:05:34 +05 2017] /usr/local/bin/text2image
--fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_
n0000000000000/T/mpdir.8n1I8qtq font_tmp.f7jJ5RoYKx
--fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/
--strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0
--outputbase=/var/folders/zz/zyxvpxvq6csfxvn_
n0000000000000/T/tmp.OaGHUsk9/div/div.MV_Typewriter.exp0 --max_pages=3
--font=MV Typewriter --text=langdata/div/div.training_text
'--text' option is missing!
ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.OaGHUsk9/div/div.MV_Typewriter.exp0.box
does not exist or is not readable
ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.OaGHUsk9/div/div.MV_Typewriter.exp0.box
does not exist or is not readable
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#43 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o-7BrowjvEOfly1dSuXAx6OHLr75ks5so0rSgaJpZM4LW3EA>
.
|
I followed this example training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only |
for tesstrain.sh there is no --text command line option |
/Users/sofwath/tesseract/training/tesstrain_utils.sh: line 197:
${sample_path}: ambiguous redirect
If you are changing the bash script, you have to make sure it is done
correctly. Please look at the error messages you get.
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Wed, Oct 4, 2017 at 3:57 PM, Sofwath ***@***.***> wrote:
for tesstrain.sh there is no --text command line option
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#43 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_oxi-RmXHecvmFfDzyTeePmC2DLf6ks5so12ogaJpZM4LW3EA>
.
|
I was able to run the program. But there are errors. See attached log file. |
I've been trying on Mac OS . giving too many errors on the bash scripts. I'll try to run the process on Linux |
@Sofwath is the dhivehi training data usable now or is it abandoned? i have not seen any updates regarding this since 2017 |
Tessdata had Dhivehi language but its missing now.Edit : I've tested it, thanks to @amitdo got links to the docs i needed. However the accuracy of the translation is far too low as i've read online here. I'm currently looking for helpers who can join me in translating the language upto 100% also will be getting in touch the Dhivehi Academy regarding this.
The text was updated successfully, but these errors were encountered: