
ImageMagik utilities: There are other questions on the SuperUser site about using ImageMagik that you might use to help you do the conversion. Gs -SDEVICE=tiffg4 -r600圆00 -sPAPERSIZE=letter -sOutputFile=filename_%04d.tif -dNOPAUSE -dBATCH - filename
APACHE PDF EXTRACT TEXT PDF
Gs: The below command should convert multipage pdf to individual tiff files. (i.e I couldn't find a linux pdf2text converter that does OCR). You might also find the pdf toolkit of use.Ī full list of pdf software here on wikipedia.Įdit: Since you do need OCR capabilities, I think you'll have to try a different tack.
APACHE PDF EXTRACT TEXT INSTALL
If it's not on your machine, you'll have to install the poppler-utils package sudo apt-get install poppler-utils For example, it does not retain any PDF metadata. Please note that the above script is very rudimentary. Gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output" "$tmpdir"/page-*.pdf

Hocr2pdf -i "$page" -o "$base.pdf" < "$base.html" Youll either want to use Apache PDFBox directly, or us Apache Tika which will do both Microsoft Office and PDF file formats (amongst many others). Apache POI works with Microsoft Office file formats, which PDF isnt. # OCR each page individually and convert into PDFĬuneiform -f hocr -o "$base.html" "$page" Yes, you are wrong in believing that POI will do that. Gs -SDEVICE=tiffg4 -r300x300 -sOutputFile="$tmpdir/page-%04d.tiff" -dNOPAUSE -dBATCH - "$input" # extract images of the pages (note: resolution hard-coded) # Run OCR on a multi-page PDF file and create a new pdf with the Sadly, the program does not appear to support creating multi-page PDFs, so you might have to create a script to handle them: #!/bin/bash I have used hocr2pdf to recreate PDFs out of the original image-only PDFs and OCR results. This way you can create "searchable" PDFs from which you can copy text.

The nice thing about it is that it can output position information for the OCR text in hOCR format, so that it becomes possible to put the text back in in the correct position in a hidden layer of a PDF file. While it appears to be essentially undocumented apart from a brief README file, I've found the OCR results quite good. Be sure to have the ImageMagick C++ libraries installed to have support for essentially any input image format (otherwise it will only accept BMP). No binary packages seem to be available, so you need to build it from source. I have had success with the BSD-licensed Linux port of Cuneiform OCR system.
