OCR to Translate PDF Images into Machine-Readable Text

September 9, 2021
FOXITBLOG

This article will tell you how to OCR PDFs and PDF portfolios with Foxit PDF Editor.

How does it Work?

Optical Character Recognition, or OCR, is a software process that enables images or printed text to be translated into machine-readable text. OCR is most commonly used when scanning paper documents to create electronic copies, but can also be performed on existing electronic documents (e.g. PDF or PDF portfolio).

When OCR is performed on an image, the software analyses the image and looks for patterns that represent characters (e.g. letters, numbers, punctuation marks). It then compares these patterns to a library of known character shapes and attempts to match them. Once the characters have been identified, the OCR software will convert them into machine-readable text.

Steps to Recognize Text

Foxit PDF Editor can detect whether a PDF file is scanned or image-based and make corresponding suggestions to initiate OCR when opening a scanned or image-based PDF. You can also run OCR anytime to recognize the image-based text in a PDF.

To recognize image-based or scanned text in a PDF file, perform the following steps:

1. Click Convert > Recognize Text > Current File, in the Recognize Text dialog box, specify the page range you need.

2. Choose the language used in your document. You can select multiple languages as well.

3. In the output type, check Searchable Text Image to make the image text selectable and searchable (or check Editable Text to enable the image text to be edited with Foxit PDF Editor). Then click OK to recognize the text.

Searchable Text Image:

During the OCR process, Foxit PDF Editor analyzes the image text and substitutes words/characters that closely approximates the image text. The substitute words/characters will be placed on an invisible layer of text in the PDF, which makes the image text selectable and searchable. If the substitution is uncertain, the text will be marked as OCR suspects which need to be corrected manually.

Image text can be extracted from a PDF and saved as a .txt file.

To OCR PDFs:

1. Open Foxit PDF Editor and go to File > Open. Find the PDF you want to OCR, select it, and click Open.
2. Go to Tools > Image Recognition OCR. The Image Recognition OCR dialog box appears.
3. In the Language drop-down list, select the language of the text in your PDF.
4. In the Output Options section, you can choose to output the recognized text to a new file, or to have it overlay the original image text in the PDF.
5. Click Start. Foxit PDF Editor will begin the OCR process. A progress bar will appear, showing you the status of the OCR process.

When the OCR process is complete, you can close the Image Recognition OCR dialog box. The text in your PDF should now be selectable and searchable. If there are any errors in the OCR process, they will be marked as OCR suspects. You can correct these errors by going to Tools > Correct Suspects.

Editable Text:

During the OCR process, Foxit PDF Editor compares the shape of the image text to the approximate fonts installed on your system, and turns the image text into editable text.

Note: If you are prompted to download the OCR component after clicking OK, please click Yes to download and install it, or download it later from the link provided and install it by clicking Install Plugin in the About Foxit Plug-Ins dialog box which pops up when you click Foxit Plug-Ins in the Help tab. To get the full version of Foxit PDF Editor, please contact us.

Follow these simple steps:

1. (Optional) If you check Find All Suspect (Show all OCR results that may need to be changed.), the OCR Suspects dialog box pops up for you to check and correct OCR suspects right after the recognition completes. To learn how to correct OCR suspects, please refer to the instructions on “Find and Correct OCR Suspects”.
If you choose Editable Text in the output type, with the Find All Suspect (Show all OCR results that may need to be changed.) optionselected, the OCRed text that Foxit PDF Editor is not certain about will be marked as OCR suspects, and the original image text will be kept until you manually handle all the OCR suspects. You can also deselect this option to turn the image text into editable text with no OCR suspects after recognition. And you can modify the text directly using the commands in the Edit tab.
2. (Optional) If you select Editable Text in Step 3, the Recognize the line segments as path objects in the PDF option is available. If the image text in your document contains tables, selecting this option helps better recognize the line segments, but it may take longer to complete recognition.
3. A recognition text process bar will pop up to show the progress.
4. Do the search function, the text on your image or scanned document will be searchable.

Tip: Foxit PDF Editor provides the Quick Recognition command under Home/Convert tab to recognize all pages of a scanned or image-based PDF with default or previous settings by one-click.

To recognize text in multiple files:

1. Click Convert > Recognize Text > Multiple Files.
2. In the Recognize Text dialog box, click Add Files to add files, folders, or currently opened files. Use Move up, Move down, and Remove to adjust the order of the files.
3. Click Output Options…. In the Output Options dialog box, select the destination folder, choose how to name the new file and whether to overwrite an existing one, and then click OK.
4. Click OK. After recognition, a message box will pop up to prompt you the recognition is finished.

Note:

1. When you are using the CJK OCR engine for the first time, the system will remind you to download and install the engine from the Foxit server.
2. If there is any unsupported file added, a “Remove unsupported file(s)” button will appear in the Recognize Text dialog box. Click the button to remove the unsupported file(s) and then continue. While recognizing a PDF portfolio, Foxit PDF Editor will only extract and recognize PDF files in the portfolio.

Getting Started

PDF portfolios offer many advantages over traditional PDFs, and OCR PDFs are an essential part of making PDF portfolios machine-readable. PDF portfolios can be easily shared, emailed, and uploaded, making them ideal for sharing large files or multiple files at once.

In addition, PDF portfolios can be password protected, ensuring that only authorized users can access the files. OCR PDFs are also smaller in size than traditional PDFs, making them easier to store and manage.

As a result, PDF portfolios offer a convenient and secure way to share information.

14 days FREE Trial – click here

How to Use OCR on PDFs and PDF Portfolios