-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for encoding detection when default encoding is not correct #956
Comments
@potaninmt thanks fir raising the issue, can you share a sample pdf? Thanks! |
@BobLd Thanks, I'm attaching the file! |
@potaninmt unfortunately, I believe this is not a pdfpig issue... Firefox, Edge and Acrobat Reader are not able to copy the text properly (which is an indicator that the pdf document is not correctly built). In Firefox, copying the following gives
|
your best chance to have correct text extracted might be OCR. Or did you manage to play around with windows-1251 / windows-1252 to get the correct text after extraction? |
@BobLd |
hi @potaninmt thanks a lot for pointing me to this library, I wasn't aware it even existed and it's extremely interesting. I had a quick look and the Ude library is based on Mozilla Universal Charset Detector. All implementations I could find of the MUCD are under MLP/GPL2/AGPL2 license, which is not really compatible with PdfPig license. 1 option would be to release a separate NuGet package under the same license, preserving PdfPig. It would also be possible to do our own implementation (trickier). I'll have a look at the first option, hopefully in the short term. Happy for you to give any implementation advice. Some references (mainly for me) I found on the topic:
|
@Charltsing Thank you! |
@BobLd Useful links: |
I noticed that there is a bug in extracting text from PDF when different encodings are contained inside. For example, one of the documents had to convert windows-1251 to windows-1252 for normal reading. Is it possible to implement in such a way that despite the many different encodings inside the document the text is extracted successfully? It is even possible that each token in a pdf can have its own encoding.
The text was updated successfully, but these errors were encountered: