Choose OCR Software for Chinese PDFs and Images
Thread poster: Kirill Loktionov
Kirill Loktionov
Kirill Loktionov
Hungary
Local time: 01:23
English to Russian
+ ...
Jun 29, 2023

Hello!
I've seen some old topics where people asked about OCR software, but what about the current-time market? The question is what application works best when recognizing scanned Chinese documents (incl. drawings) or documents saved as images? We use ABBYY FineReader 15 for most of the languages but its quality of Chinese OCR is really bad. It often misrecognizes characters if they are blurred or simply expanded/condensed. Any analogues? I tried ReadIris, and it has way less options when
... See more
Hello!
I've seen some old topics where people asked about OCR software, but what about the current-time market? The question is what application works best when recognizing scanned Chinese documents (incl. drawings) or documents saved as images? We use ABBYY FineReader 15 for most of the languages but its quality of Chinese OCR is really bad. It often misrecognizes characters if they are blurred or simply expanded/condensed. Any analogues? I tried ReadIris, and it has way less options when setting areas on a page. Online resources like 2ocr work great, though the result is just a plain text, so it may only help as a support unit at parts, where FineReader fails.

[Edited at 2023-06-29 06:41 GMT]

[Edited at 2023-06-29 11:25 GMT]
Collapse


 
Sakshi Garg
Sakshi Garg  Identity Verified
India
Local time: 04:53
Member
Tesseract Jun 30, 2023

Hi,

I hope you are doing well! There are multiple softwares in the market now-a-days that support OCR facilities to the maximalist. For Chinese, I personally prefer Tesseract.

Tesseract is an open-source OCR engine that supports numerous languages, including Chinese. It can be a bit technical to set up and use, but it is known for its high accuracy.

You may try it once to see the accuracy of the characters.

I hope this helps!

Thank
... See more
Hi,

I hope you are doing well! There are multiple softwares in the market now-a-days that support OCR facilities to the maximalist. For Chinese, I personally prefer Tesseract.

Tesseract is an open-source OCR engine that supports numerous languages, including Chinese. It can be a bit technical to set up and use, but it is known for its high accuracy.

You may try it once to see the accuracy of the characters.

I hope this helps!

Thank you.

Regards
S
Collapse


 
Milan Condak
Milan Condak  Identity Verified
Local time: 01:23
English to Czech
PDF24 Jun 30, 2023

Sakshi Garg wrote:

Tesseract is an open-source OCR engine that supports numerous languages, including Chinese.


Tesseract is part of several SWs that have user interfaces. One of them is the pdf24 suite of programs. Look for OCR.

https://www.pdf24.org/zh/

https://www.pdf24.org/en/

https://www.pdf24.org/cs/

Milan


 
Kirill Loktionov
Kirill Loktionov
Hungary
Local time: 01:23
English to Russian
+ ...
TOPIC STARTER
Tesseract Settings Jul 1, 2023

Sakshi Garg wrote:

Hi,

I hope you are doing well! There are multiple softwares in the market now-a-days that support OCR facilities to the maximalist. For Chinese, I personally prefer Tesseract.

Tesseract is an open-source OCR engine that supports numerous languages, including Chinese. It can be a bit technical to set up and use, but it is known for its high accuracy.

You may try it once to see the accuracy of the characters.

I hope this helps!

Thank you.

Regards
S


Hi Sakshi,

Thank you for a cue! But how can I use GUI with Tesseract? Unfortunately machine still does not understand which and what areas to recognize by itself. Is there documentation for such a setting?

Kind regards,
Kirill


 
Mr. Satan (X)
Mr. Satan (X)
English to Indonesian
Choices Jul 2, 2023

Kirill Loktionov wrote:
But how can I use GUI with Tesseract?


You have several choices:
https://tesseract-ocr.github.io/tessdoc/User-Projects-–-3rdParty.html

That being said, I don't work with Chinese language in any capacity. So I don't know if it is any good for Hanzi.

Is there documentation for such a setting?


The man page for Tesseract:
https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc

HTH, FWIW.

Milan Condak wrote:
Tesseract is part of several SWs that have user interfaces.


I don't think this description is particularly accurate. Other software you are referring to are simply the graphical front-ends for the Tesseract program itself. It's the similar situation to ffmpeg or espeak. You can use these programs from the command-line interface, which I usually prefer.

[Edited at 2023-07-02 01:55 GMT]


 
Kirill Loktionov
Kirill Loktionov
Hungary
Local time: 01:23
English to Russian
+ ...
TOPIC STARTER
Interesting Jul 4, 2023

I am a bit flabbergasted that I am writing the following words… Thank you Mr. Satan!
I have tried several of the products mentioned here: https://tesseract-ocr.github.io/tessdoc/User-Projects-–-3rdParty.html
particularly,
Rescribe (unfortunately, I couldn't have even seen connection to server to open a document, just local folders),
nor
... See more
I am a bit flabbergasted that I am writing the following words… Thank you Mr. Satan!
I have tried several of the products mentioned here: https://tesseract-ocr.github.io/tessdoc/User-Projects-–-3rdParty.html
particularly,
Rescribe (unfortunately, I couldn't have even seen connection to server to open a document, just local folders),
normcap (I did not understand how to launch it, is it for Win PCs?),
Free-Ocr-Windows-Desktop (looks like no other languages apart from En/De/Es are available — found no settings for it, alas it is a plain text OCR and job quality in English is quite low, e.g. it understood 'MACHINE MAINTENANCE INSTRUCTIONS' as 'uacanwc zxmn-ru\'An'cc msnaucnorvs').
I guess there is nothing as flexible as FineReader (except for it is a proprietary software). Perhaps there are some Chinese competitors to work with logograms. Time will show us.
Collapse


 
Mr. Satan (X)
Mr. Satan (X)
English to Indonesian
Using Tesseract from the Command-line Interface Jul 5, 2023

This is why I prefer using Tesseract from the command-line interface. It worked quite nicely for me when I had to deal with scanned English documents. Here are the commands to use it without GUI. Please note that I modified it to your specific use case by adding the language parameter for Chinese language, with both traditional and simplified variants. The language parameter is not required if the source document is in English, since Tesseract defaults to this. Feel free to pick one that suits y... See more
This is why I prefer using Tesseract from the command-line interface. It worked quite nicely for me when I had to deal with scanned English documents. Here are the commands to use it without GUI. Please note that I modified it to your specific use case by adding the language parameter for Chinese language, with both traditional and simplified variants. The language parameter is not required if the source document is in English, since Tesseract defaults to this. Feel free to pick one that suits your needs.

tesseract INPUT_FILENAME OUTPUT_FILENAME -l chi_tra
tesseract INPUT_FILENAME OUTPUT_FILENAME -l chi_sim


You will need Chinese language packages installed. The same is true for any languages you are working with. I'm using Linux, so it's easy to get them as they are available in the official repository. My apologies, but I can't help you if you're using Windows.

I should mention that Tesseract by itself doesn't support PDF as input file format. For this, I'd use GIMP with export-layer plugin to convert the PDF document into separate image files. Then I'd extract the texts with Tesseract using the commands above.

https://www.gimp.org/
https://github.com/kamilburda/gimp-export-layers
https://www.linuxuprising.com/2019/03/how-to-convert-pdf-to-image-png-jpeg.html

[Edited at 2023-07-05 01:04 GMT]
Collapse


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Choose OCR Software for Chinese PDFs and Images






Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »
Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio

Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.

More info »