I have been doing a lot of OCR, as I study more than 100 old aircraft manuals to see how aviation procedures evolved. I have them all in a database, and it’s useful to search the DB for key terms like V1 and density altitude. In the end, no single OCR program did everything, and I have ended up with 3. (OCR = Optical Character Recognition = takes scanned documents and makes them searchable, copyable, etc.) Here are some notes on my experience, with the goal of saving time for others in the future.
The cheapest is PDFpen, which is an inexpensive (and easier to use) replacement for Adobe Acrobat. The second is Adobe Acrobat itself. It turns out to have an annoying bug: it refuses to process documents that include even a single page which has already been converted to text. So I dropped another $100 on Abbyy Finereader, which is a single-purpose OCR program that is the most sophisticated and diligent about OCR.
Some of the original files I’m dealing with were photographed with exquisite care and resolution, and as a result are more than 500 MB. Although I can work with files that size, the total database is now over 50 GB, and is stretching my solid state drive. (My machine = Macbook Pro retina display, 4 cores, 8 GB of main memory, 500GB solid state drive.) I certainly don’t need this kind of resolution, so I have been trying to shrink the biggest ones. I turns out that 2 of these 3 OCR programs do a great job of shrinking these files, with no visible loss of resolution, and almost no added effort on my end.
- Initial file size: 76 MB, 60 pages, with some color mainly due to aging.
- PDFpen (Pro version): Final size = 78 MB. No compression
- Adobe Acrobat: Final size = 10.0 MB
- Finereader: Final size = 11.0 MB
In terms of speed, PDFpen is the fastest. Finereader can be very very slow, especially when it runs into figures that it “thinks” might actually be text. A single page of a very detailed graph can take 2 minutes, and I’ve needed to run some 700 page documents overnight. Acrobat is in the middle.
As a side note, Acrobat has a number of settings for compression. You can shrink documents another factor of 2 if you settle for slightly blurred text. Or, you can set it so there is no compression at all. Finereader may have an on/off switch for compression, but I’ve never investigated in detail. PDFpen has OCR as an afterthought, and seems to have no controls except selecting your language.
I am not attempting a comprehensive review or comparison here. If I really wanted a dedicated OCR program, Finereader is probably the way to go. PDFpen Pro is at the other extreme. It is good enough as a PDF editor that it can replace both Apple Preview and Adobe Acrobat.
Comments and your own experiences are welcome.