Unfortunately, the documentation for tesseract isn't very clear, so it's difficult for beginners to learn what needs to be tweaked, or how to do it. This page explains some basic ways to improve its performance.
Also, make sure you have enough resolution to record all the details of the printed words on the page. Normally, average-sized print in books and journals can be scanned at 300 dpi; but unusually small print, or a typeface with very thin lines, will produce better results at 400 or 600 dpi. A rule of thumb is that the thinnest lines in the letter glyphs should be 2 pixels wide; this is usually possible if the x-height of the font is between 20 and 30 pixels.
And be careful to record page images using lossless compression. Often page images are compressed as JPEG images with some default quality (like 70 or 80) that is perfectly legible to the eye, and looks reasonable on casual inspection, but contains lots of compression artifacts that will make OCR detection of text very inaccurate. (The Internet is full of such images.) If you can't avoid JPEG compression, at least use the highest quality setting available.
If all you have is pages scanned at some distant library, you may discover that it's contaminated with noise — particularly if the pages are old, or if they were printed on rag paper that contained lots of little colored fibers. This typically shows up when tesseract complains that there are “lots of diacritics” when it searches for letters. In extreme cases, it may be useful to turn on the textord_heavy_nr setting, which is normally zero (i.e., off). But that is so heavy-handed that it usually makes an appreciable fraction of the text unreadable. [The ocrmypdf front end has its own -c (or --clean) option, which is much milder.]
Often, you can tell whether the noise-removal process has also removed important information by checking the little dots in glyphs like periods, commas, colons, and semicolons. If numerical values tend to lose their decimal points, the cleaning process should be toned down (even if tesseract continues to complain about diacritical marks).
If you tell tesseract to include the cleaned images in the final product (by using the -i option of ocrmypdf), you can see what kind of noise has caused errors in the OCR text. This may suggest what other parameters might be tweaked to reduce noise further.
Most of my own use of tesseract has been to make PDFs of old books searchable. However, I've also tried to extract numerical tables from scans of technical references; a particular irritation has been the omission of tables and figures from the scans available at Google Books. (Google was evidently interested only in extracting text from them.)
For simple tasks, the ocrmypdf script is very handy. You can get some idea of what ocrmypdf is doing by turning on its verbose (i.e., -v) option, which produces a surprisingly large amount of output. But for more complicated problems, like extracting numerical values from tables, you need finer control over tesseract than ocrmypdf alone can provide.
Whatever you want to do, you need a better understanding of how both these commands operate, and how they can be controlled, than the regular documentation provides.
On the other hand, the -d (or --deskew) and -c (or --clean) options to ocrmypdf almost always improve the accuracy of the extracted text. So those options should be used routinely for most tasks. But notice that many PDF files available from Google Books and other libraries have already been properly cleaned, and don't need the -c option.
This local wordlist will be different for every document, so it makes sense to provide a different list for each one. Reading the man page can mislead you to think that such a supplemental wordlist must be named with the suffix specified by the user_words_suffix parameter, and/or that it has to be placed in the tessdata directory — which would allow only one such file for every language. Actually, it's possible to have a single supplemental wordlist if it's in the tessdata directory and has the specified suffix; but that doesn't prevent you from having a local list with an arbitrary filename. You just have to specify the path to that file on the command line, either with tesseract's --user-words= option, or with its -c option followed by a user-words-file= argument.
In recent versions of ocrmypdf, the local wordlist file can be named with a --user-words option on the ocrmypdf command line. If you are using version 3 of tesseract, you have to point to any local wordlist(s) in a local config file, which is turn can be named in a --tesseract-config option to ocrmypdf. Either way, a supplemental dictionary can be provided. (Note that you can have only one local wordlist file, and only one local config file. If you have more than one wordlist file, they should all be concatenated before being named in a command line or the local configuration file.) The wordlists do not need to be sorted.
Bear in mind that these dictionary words are only hints to tesseract; it isn't a spell-corrector like aspell. However, you can “load the dice” by changing the relative weights assigned to dictionary and non-dictionary words. These are the language_model_penalty_non_freq_dict_word and language_model_penalty_non_dict_word variables, which are only 0.1 and 0.15 by default. Increasing these values should put more weight on the dictionaries. CAUTION: putting too much weight on the dictionaries will make the engine turn noise, or real words not in the dictionary, into dictionary words; so be careful.
The problem is that tesseract stores several different internal images for every letter, because a document might contain the glyph in several different font sizes and styles. So if you tell it to pay more attention to the bad glyphs, that shifts its mapping of shapes on the page to characters in the text encoding. Putting words that were missed into the dictionary shifts the detection criteria away from well-formed glyphs and favors bad ones.
Evidently, we need to tell it to pay a little more attention to the shapes it misinterpreted initially, while continuing to pay attention to the things it got right (rather than ignoring them). So all the correct words need to be included in the wordlist file; but we also need to include the words we can read but tesseract couldn't.
In short: we really need to make it look for just the words that really exist in the file, and ignore all other possibilities. Ideally, the dictionary should contain all the real words, and no others.
Then, in principle, we could force tesseract to accept only the words in this perfect dictionary by raising the penalties on non-dictionary words.
That still won't guarantee perfect character recognition, for two reasons. First, we won't have a list of all the real words in the PDF without manually checking every word in it; spell-checkers always seem to miss a few uncommon words that occur in real texts.. And second, even if the new dictionary list were perfect, there still might be indistinct glyphs in the page images that tesseract could mistake for other characters that form a correctly spelled (but wrong) word. Human readers can usually fix such errors by understanding the context of the ambiguous word, but machines don't understand anything.
Note: using a local dictionary tuned to the right context on the first pass could also help appreciably.
And there is another front end to tesseract called pdfsandwich that can split PDF pages into two columns. The problem with using it is that it does not pass as many options to tesseract itself as ocrmypdf does. So you might want to first use pdfsandwich to split the pages vertically, and then invoke tesseract on the separated columns to get the desired results.
Another problem with tables is that they have lots of whitespace between columns. You can help handle this by setting the parameter preserve_interword_spaces to 1 — but it does not preserve space at the left end of a line; instead, the lines are all left-justified (because there is no interword space to the left of the first word on a line). However, if you extract the text by using pdftotext with its -layout option, some whitespace at the left margin appears in the text.
A related problem is that tesseract often breaks up the rows of tables — often because the lines of type were not perfectly level. (That can be cured by using the -d option to ocrmypdf, which de-skews the tilted text.) You can keep all the parts of a table row together by setting the -psm mode to 4 or 6, instead of the default 3. Mode 4 keeps rows together even if they contain a variety of fonts; if you can rely on one font being used across an entire row, use psm mode 6.
There is a parameter called textord_tablefind_recognize_tables, which is normally turned off, but can be turned on by setting it to 1 (i.e., True.) Similarly, there is another called textord_show_tables. It appears that these only find tables with psm set to 1 to 4. These parameters seem to be used only in the layout analysis, without affecting the actual OCR of tables.
To extract the OCRed text from a PDF that has been processed by tesseract, you can use
pdftotext -layoutto show the text from the OCRed file. This will separate fields with <TAB> characters (or their equivalent: 8 spaces), which make it fairly unwieldy.
Another way to copy table data from an OCRed image is to display the image with a browser, and copy the (invisible) OCR text to the clipboard by scanning the cursor along the rows on the displayed page. Then you can re-copy the text to a file from the clipboard. This involves a lot of mouse work, but it's still better than trying to copy a table manually.
A similar but less restrictive way to focus tesseract's attention on numbers is to set the parameter classify_bin_numeric_mode to 1 in your supplemental config file or command-line option. This, plus careful blacklisting of inappropriate characters, can produce fairly good OCR of numerical tables.
Notice that there are two “blacklisting” parameters: tessedit_char_blacklist and tessedit_char_unblacklist, which are often misunderstood. Blacklisting a character only prevents it from being produced in tesseract's output; but some other character will be produced in its place. Sometimes the minus signs in numerical tables get OCR'd as em-dashes or underscores; then blacklisting those will prevent them from appearing where minus signs should be in the extracted text.
Any character that is normally killed in a blacklist can be revived by putting it in a tessedit_char_unblacklist string. This can make the ocr engine look for some special glyph, like a case fraction or a special Unicode symbol.
One way to work around this problem is to put the words, or word fragments, used in the headings into a supplemental wordlist, as described above. Adding the strings used as column headers to a supplemental dictionary list should help.
For example, a table column that contains only hours and minutes must have only 1- or 2-digit numbers smaller than 13 in the hours column, and less than 60 in the minutes. You can put those small sets of numbers into your supplemental “user-words” dictionary, as long as the OCR engine is looking for numbers as well as dictionary words. Even if there are other columns in the table with less limited numerical values, this might still help prevent a mis-OCRed value of 84 appearing in a “minutes” column.
Copyright © 2023 – Andrew T. Young
or the
alphabetic index page
or the
GF home page
or the website overview page