Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require Feedback Regarding "DV" or Dhivehi language. (Low translation accuracy also needs proper font) #43

Open
Xayaan opened this issue Dec 28, 2016 · 33 comments

Comments

@Xayaan
Copy link

Xayaan commented Dec 28, 2016

Tessdata had Dhivehi language but its missing now.
Edit : I've tested it, thanks to @amitdo got links to the docs i needed. However the accuracy of the translation is far too low as i've read online here. I'm currently looking for helpers who can join me in translating the language upto 100% also will be getting in touch the Dhivehi Academy regarding this.

@amitdo
Copy link

amitdo commented Dec 28, 2016

https://github.com/tesseract-ocr/tesseract/blob/9c7e99b041/training/language-specific.sh#L32

@amitdo
Copy link

amitdo commented Aug 1, 2017

div.traineddata was added to the repo
https://github.com/tesseract-ocr/tessdata/blob/master/best/div.traineddata

@Shreeshrii
Copy link
Contributor

@Xayaan Have you tried the 'best' version? Any feedback?

@Xayaan
Copy link
Author

Xayaan commented Aug 18, 2017

@Shreeshrii yes i have tried and the results are pretty bad. It needs to be trained intensively.

@Xayaan Xayaan closed this as completed Aug 18, 2017
@Shreeshrii
Copy link
Contributor

Shreeshrii commented Aug 18, 2017

Do not close the issue then, you can change title to say feedback regarding Dhivehi and then add some notes regarding what is wrong so that it can be improved.

Also see tesseract-ocr/langdata#52

@Xayaan Xayaan reopened this Aug 18, 2017
@Xayaan Xayaan changed the title Missing "Dhivehi" Language. Require Feedback Regarding "DV" or Dhivehi language. (Low translation accuracy also needs proper font) Aug 18, 2017
@Xayaan
Copy link
Author

Xayaan commented Aug 18, 2017

Done, thank you! 👍

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Aug 18, 2017

Are these fonts suitable for Dhivehi ?

http://www.hassanhameed.com/?page_id=152

http://www.wazu.jp/gallery/Fonts_Thaana.html

@Xayaan
Copy link
Author

Xayaan commented Aug 18, 2017

Yes, but I'd trust these : https://dhivehi.mv/fonts/

@Shreeshrii
Copy link
Contributor

https://dv.wikipedia.org/wiki/%DE%89%DE%A6%DE%87%DE%A8_%DE%9E%DE%A6%DE%8A%DE%B0%DE%99%DE%A7

@theraysmith
This script looks similar to Arabic with accents. Have you had success in adding the accented version for next training?

@amitdo
Copy link

amitdo commented Aug 18, 2017

There is also Thaana traineddata

@Shreeshrii
Copy link
Contributor

@Xayaan please check with Thaana traineddata also.

If possible, provide an image and it's corresponding ground truth file for testing.

@Xayaan
Copy link
Author

Xayaan commented Aug 24, 2017

Yes, it is. It is similiar to sanskrit and arabic. A RTL language.

I checked with the thaana traineddata, its not very accurate and has low accuracy now.

@Sofwath
Copy link

Sofwath commented Oct 2, 2017

any pointers to the training data used for Thaana?

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Oct 2, 2017 via email

@Sofwath
Copy link

Sofwath commented Oct 2, 2017

it looks like that is for Thai. How about for Thaana (div)

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Oct 2, 2017 via email

@Sofwath
Copy link

Sofwath commented Oct 2, 2017

Link to Thaana (div) monogram and bigram file

https://github.com/Sofwath/thaanaOCR/tree/master/data

This is a Thaana text corpus

https://www.dropbox.com/s/04ox44rfuqm5xhw/dv_MV_1.txt?dl=0

Anything else that we need to have for a basic training ?

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Oct 2, 2017 via email

@Sofwath
Copy link

Sofwath commented Oct 2, 2017

Great. Thanks. Will work on that.

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Oct 2, 2017 via email

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Oct 2, 2017 via email

@Sofwath
Copy link

Sofwath commented Oct 3, 2017

question: do i still need to create the box files even if we are using the lstm method?

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Oct 3, 2017 via email

@Sofwath
Copy link

Sofwath commented Oct 4, 2017

Any help? I am getting this error

sudo /Users/sofwath/tesseract/training/tesstrain.sh --fonts_dir /Users/sofwath/dev/MLAI/tesseract/font/ --lang div --linedata_only --noextract_font_properties --langdata_dir langdata --tessdata_dir tessdata/ --output_dir divtrain/

=== Starting training for language 'div'
mktemp: illegal option -- -
usage: mktemp [-d] [-q] [-t prefix] [-u] template ...
mktemp [-d] [-q] [-u] -t prefix
[Wed Oct 4 10:49:19 +05 2017] /usr/local/bin/text2image --fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/ --font=MV Typewriter --outputbase=/sample_text.txt --text=/sample_text.txt --fontconfig_tmpdir=

=== Phase I: Generating training images ===
Rendering using MV Typewriter
[Wed Oct 4 10:49:20 +05 2017] /usr/local/bin/text2image --fontconfig_tmpdir= --fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/ --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.GrmD2E1v/div/div.MV_Typewriter.exp0 --max_pages=3 --font=MV Typewriter --text=langdata/div/div.training_text
ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.GrmD2E1v/div/div.MV_Typewriter.exp0.box does not exist or is not readable
ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.GrmD2E1v/div/div.MV_Typewriter.exp0.box does not exist or is not readable

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Oct 4, 2017 via email

@Sofwath
Copy link

Sofwath commented Oct 4, 2017

changed
export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX)
to
export FONT_CONFIG_CACHE=$(mktemp -d -tmpdir font_tmp.XXXXXXXXXX)
in tesstrain_utils.sh and the first error was fixed but still get

=== Starting training for language 'div'
/Users/sofwath/tesseract/training/tesstrain_utils.sh: line 197: ${sample_path}: ambiguous redirect
[Wed Oct 4 14:05:34 +05 2017] /usr/local/bin/text2image --fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/ --font=MV Typewriter --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/mpdir.8n1I8qtq font_tmp.f7jJ5RoYKx/sample_text.txt --text=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/mpdir.8n1I8qtq font_tmp.f7jJ5RoYKx/sample_text.txt --fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/mpdir.8n1I8qtq font_tmp.f7jJ5RoYKx
'--text' option is missing!

=== Phase I: Generating training images ===
Rendering using MV Typewriter
[Wed Oct 4 14:05:34 +05 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/mpdir.8n1I8qtq font_tmp.f7jJ5RoYKx --fonts_dir=/Users/sofwath/dev/MLAI/tesseract/font/ --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.OaGHUsk9/div/div.MV_Typewriter.exp0 --max_pages=3 --font=MV Typewriter --text=langdata/div/div.training_text
'--text' option is missing!
ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.OaGHUsk9/div/div.MV_Typewriter.exp0.box does not exist or is not readable
ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.OaGHUsk9/div/div.MV_Typewriter.exp0.box does not exist or is not readable

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Oct 4, 2017 via email

@Sofwath
Copy link

Sofwath commented Oct 4, 2017

I followed this example

training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only
--noextract_font_properties --langdata_dir ../langdata
--tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain

@Sofwath
Copy link

Sofwath commented Oct 4, 2017

for tesstrain.sh there is no --text command line option

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Oct 4, 2017 via email

@Shreeshrii
Copy link
Contributor


=== Starting training for language 'div'
[Thu Oct 5 19:39:30 DST 2017] /usr/local/bin/text2image --fonts_dir=/mnt/c/Windows/Fonts --font=MV Typewriter --outputbase=/tmp/font_tmp.v2PwMI2E8F/sample_text.txt --text=/tmp/font_tmp.v2PwMI2E8F/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.v2PwMI2E8F
Rendered page 0 to file /tmp/font_tmp.v2PwMI2E8F/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using MV Typewriter
[Thu Oct 5 19:40:32 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.v2PwMI2E8F --fonts_dir=/mnt/c/Windows/Fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.jFPtcB8yoM/div/div.MV_Typewriter.exp0 --max_pages=3 --font=MV Typewriter --text=../langdata/div/div.training_text
Rendered page 0 to file /tmp/tmp.jFPtcB8yoM/div/div.MV_Typewriter.exp0.tif
Rendered page 1 to file /tmp/tmp.jFPtcB8yoM/div/div.MV_Typewriter.exp0.tif
Rendered page 2 to file /tmp/tmp.jFPtcB8yoM/div/div.MV_Typewriter.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
[Thu Oct 5 19:40:42 DST 2017] /usr/local/bin/unicharset_extractor --output_unicharset /tmp/tmp.jFPtcB8yoM/div/div.unicharset --norm_mode 2 /tmp/tmp.jFPtcB8yoM/div/div.MV_Typewriter.exp0.box
Extracting unicharset from box file /tmp/tmp.jFPtcB8yoM/div/div.MV_Typewriter.exp0.box
Word started with a combiner:0x7b0
Normalization failed for string 'ްނ'
Word started with a combiner:0x7aa
Normalization failed for string 'ުށ'
Word started with a combiner:0x7ac
Normalization failed for string 'ެފ'
Word started with a combiner:0x7b0
Normalization failed for string 'ްށ'
Word started with a combiner:0x7ae
Normalization failed for string 'ޮކ'
Word started with a combiner:0x7b0

I was able to run the program. But there are errors. See attached log file.

tesstrain.log.txt

@Sofwath
Copy link

Sofwath commented Oct 6, 2017

I've been trying on Mac OS . giving too many errors on the bash scripts. I'll try to run the process on Linux

@nashrafeeg
Copy link

@Sofwath is the dhivehi training data usable now or is it abandoned? i have not seen any updates regarding this since 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants