diff --git a/README.md b/README.md new file mode 100644 index 0000000..cbd873b --- /dev/null +++ b/README.md @@ -0,0 +1,16 @@ +# langdata +Source training data for Tesseract for lots of languages + +Want to re-train tesseract for a specific language, by modifying/augmenting the original training data? +Then you have come to the right place! + +If you want to find a language data set to run Tesseract, then look at our +[tessdata repository](https://github.com/tesseract-ocr/tessdata) instead. + +To re-create the training of a single language, _lang,_ you need the following: +* All the data in the _lang_ directory. +* The corresponding unicharset/xheights files for the script(s) used by _lang._ +* All the remaining non-lang-specific files in the top-level directory, such as `font_properties.` +* You also need to obtain the fonts needed to train the language. +Some languages were trained with commercially available fonts, so you will need to buy them in order to +reproduce the training exactly, or use substitutes.