Adobe has always called the OCR function in Acrobat "paper capture." And in previous versions of Acrobat this has been a weak function. For example, in Acrobat 5.0 you had to download a plug-in to do Paper Capture, and even then you could only OCR 50 pages at a time. And, if you did, you'd learn that the process was slow. I guess the idea was that if you wanted to do serious OCR you'd pay Adobe for the high-end Paper Capture product that they sell.
Well, I'm here to report that Acrobat Standard 6.0 does a great job with OCR/Paper Capture. I just OCR'd a 100 page deposition transcript (which was very good quality text) and Acrobat did the entire conversion process in about 4 minutes. I timed it at one point and it was converting pages at about a 20ppm rate, so it might actually have taken less time.
I have noted that this new version of Acrobat also converts TIFF files to PDF much quicker. So, despite my initial dissatisfaction with the radically revamped interface, I have to say this new version has some real power where it counts.
Update: I OCR'd a 600 page batch of documents, many of which were not very clean copies (i.e. the sort of documents that make OCR engines choke and sputter). I knew it would take longer than 20 ppm (and it did) so I set it to do the work as I was leaving for the day. When I returned it had hung up midway through, but it wasn't a big deal. I clicked the dialogue box that basically said "ignore this error in the future and keep working no matter what." It hummed along quite briskly and finished the task in about a half hour. The resulting file size was about 87 MBs, but I ran a "Save As" and it compressed down to 26 MBs. Very nice.
When using Acrobat 6.0 to OCR, rather than Capture, is there a way to correct OCR mistakes?
Do you have any idea of how Acrobat compares with Capture in terms of OCR accuracy?
Posted by: Gene Koo | November 06, 2003 at 01:03 PM
I need to understand this feature better. Is it in Reader or only Acrobat? I have 200 page PDF file created when someone used an HP document reader and sent it to me via email. So, it is full of images of pages, not the text. I need to get the text. Any help will be greatly appreciated. Email me direct if you'd like. -- Clint (clint@robotic.com)
Posted by: Clint Laskowski | December 12, 2003 at 05:55 PM
Am using the factory fitted OCR in HP 5400c Scanner for scanning huge copies of legal decisions and its editting is poor. Ps advise.
Posted by: enobng | January 12, 2004 at 12:23 PM
Second on the request for a batch OCR process. Anyone know how to do it?
Posted by: DJ | July 17, 2004 at 04:08 PM