tesseract
Web: https://tesseract-ocr.github.io/
Note: Chrome/Edge/Firefox extension OCR Image Reader
calls tesseract.js
Installation
module load ceuadmin/leptonica/1.85.0
module load ceuadmin/libgcrypt
module load ceuadmin/autoconf
wget -qO- https://github.com/tesseract-ocr/tesseract/archive/refs/tags/5.5.0.tar.gz | \
tar xvfz -
cd tesseract-5.5.0/
./autogen.sh
./configure --prefix=$CEUADMIN/tesseract/5.5.0 CXXFLAGS="-std=c++17" LDFLAGS="-lstdc++fs"
make && make install
# Languages
wget -qO- https://github.com/tesseract-ocr/tessdata_best/archive/refs/tags/4.1.0.tar.gz | \
tar xvfz -
export TESSDATA_PREFIX="/usr/local/Cluster-Apps/ceuadmin/tesseract/5.5.0/share/tessdata_best-4.1.0"
cd $TESSDATA_PREFIX
ln -s /usr/local/Cluster-Apps/ceuadmin/tesseract/5.5.0/share/tessdata/configs
# Modern Greek
wget https://github.com/tesseract-ocr/tessdata_best/raw/main/ell.traineddata
# Ancient Greek
wget https://github.com/tesseract-ocr/tessdata_best/raw/main/grc.traineddata
# Equation detection
wget https://github.com/tesseract-ocr/tessdata_best/raw/main/equ.traineddata
Testing
A list of languages can be viewed and used for OCR from image,
module load ceuadmin/tesseract
tesseract --list-langs
# ==> ucam.txt
tesseract ucam.jpeg ucam -l eng
# ==> ucam.hocr
tesseract ucam.png ucam -l eng --psm 3 -c tessedit_create_hocr=1
# ==> ucam.pdf
tesseract ucam.png ucam -l eng --psm 3 -c tessedit_create_pdf=1
module load ceuadmin/libiconv ceuadmin/poppler/0.84.0
pdffonts ucam.pdf
The last command gives,
$ pdffonts ucam.pdf
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
GlyphLessFont CID TrueType Identity-H yes no yes 3 0
Tesseract + OCRmyPDF + ghostscript / img2pdf
Several experiments are conducted below,
source ~/rds/software/py3.11/bin/activate
pip install ocrmypdf
# alpha channel on png
convert ucam.png -alpha remove -alpha off ucam_noalpha.png
convert ucam_noalpha.png ucam_noalpha.pdf
ocrmypdf --tesseract-config hocr -l eng+ell ucam_noalpha.pdf ucam_ocr.pdf
# ghostscript, img2pdf
module load ceuadmin/ghostscript/9.56.1
module load ceuadmin/jbig2enc/0.30
module load ceuadmin/pngquant/3.0.3
## 1st attempt
ocrmypdf -j 5 --force-ocr --optimize 3 --tesseract-timeout 300 -l eng+ell \
Formulas\ and\ Theorems\ for\ the\ Special\ Functions\ of\ Mathematical\ Physics\,\ 3e.pdf temp_ocr.pdf
gs -o out.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress temp_ocr.pdf
pdffonts temp_ocr.pdf
## 2nd attempt
pdftoppm -r 450 Formulas\ and\ Theorems\ for\ the\ Special\ Functions\ of\ Mathematical\ Physics\,\ 3e.pdf page -png
img2pdf page-*.png -o image_only.pdf
ocrmypdf -j 5 --force-ocr --optimize 3 --tesseract-timeout 300 -l eng+ell image_only.pdf out2.pdf
where and jbig2enc leads to smarter text compression and pngquant for color image compression. We see that
## 1st attempt
$ ocrmypdf -j 5 --force-ocr --optimize 3 --tesseract-timeout 300 -l eng+ell \
Formulas\ and\ Theorems\ for\ the\ Special\ Functions\ of\ Mathematical\ Physics\,\ 3e.pdf temp_ocr.pdf
Scanning contents ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 516/516 0:00:00
Start processing 5 pages concurrently ocr.py:96
...
OCR ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 516/516 0:00:00
Postprocessing... ocr.py:144
PDF/A conversion ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 516/516 0:00:00
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP _metadata.py:63
metadata.
Linearizing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
Recompressing JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Deflating JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
PNGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
JBIG2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 514/514 0:00:00
Image optimization ratio: 1.19 savings: 16.1% _pipeline.py:1002
Total file size ratio: 2.06 savings: 51.4% _pipeline.py:1005
Output file is a PDF/A-2B (as expected) _common.py:474
GPL Ghostscript 9.56.1 (2022-04-04)
Copyright (C) 2022 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 516.
$ gs -o out.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress temp_ocr.pdf
$ pdffonts temp_ocr.pdf
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
NVDUUB+GlyphLessFont CID TrueType Identity-H yes yes yes 2034 0
ZJFOCW+GlyphLessFont CID TrueType Identity-H yes yes yes 1631 0
PHSYEO+GlyphLessFont CID TrueType Identity-H yes yes yes 1734 0
QXKYDM+GlyphLessFont CID TrueType Identity-H yes yes yes 1838 0
EHRHOZ+GlyphLessFont CID TrueType Identity-H yes yes yes 1943 0
ORTKYM+GlyphLessFont CID TrueType Identity-H yes yes yes 2133 0
## 2nd attempt
$ ocrmypdf -j 5 --force-ocr --optimize 3 --tesseract-timeout 300 -l eng+ell image_only.pdf out2.pdf
Scanning contents ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 526/526 0:00:00
Start processing 5 pages concurrently ocr.py:96 3 [tesseract] read_params_file: Can't open txt tesseract.py:257
...
OCR ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 526/526 0:00:00
Postprocessing... ocr.py:144
PDF/A conversion ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 526/526 0:00:00
Linearizing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
Recompressing JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 4/4 0:00:00
Deflating JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 4/4 0:00:00
PNGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
JBIG2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
Image optimization ratio: 1.01 savings: 0.7% _pipeline.py:1002
Total file size ratio: 0.98 savings: -1.6% _pipeline.py:1005
Output file is a PDF/A-2B (as expected) _common.py:474
out.pdf keeps all the bookmarks, and is smaller in contrast to out2.pdf with no bookmarks and much larger. In both cases we can check the pdf, e.g.,
pdffonts out.pdf
pdftotext out.pdf - | less
Nevertheless according to pdffonts temp_ocr.pdf
, one might as well not to use ghostscript at all.