Optical Character Recognition (or 'OCR') is a great tool. As most of you know, when you have a scanned file it's basically just an image. Even though the image may be a document that contains words the computer regards those words as pixels that it displays. A word-processing file, by contrast, is an assemblage of characters that the computer can recognize as such, which is why you can word search a text-based document but not an scanned image. Unless, you OCR the image file.
When you tell the computer to do OCR you are asking it to do something very sophisticated. The computer has to analyze each assemblage of pixels to determine what character that assemblage might be. The cleaner the pixels the better chance the computer will guess right when it decides what character it is.
Adobe Acrobat has long had an OCR function, but in prior versions it was called "paper capture." Acrobat 6.0 was the first version that, to my mind, handled OCR reliably. Acrobat 7.0 does an even better job, although it introduces other quirks in other areas that I'm not crazy about. In any event, the OCR/Paper Capture function is a great tool in Acrobat because it keeps the image file intact but identifies the characters in the image file so that you can search across a document set for key words. Obviously, this is a nice tool for litigators who deal with document productions. And, so even though it takes a fair amount of time to OCR a document (approximately 15 seconds per page, more or less), it's often worthwhile. Which brings me to reader mail.
Today I got a great question from a reader about a problem he had when he ran the OCR function in Acrobat 6.0:
One question regarding the OCR function – have you come across the problem where part of a scanned page had “renderable text” but the remainder does not? Apparently Acrobat 6.0 decides that it cannot OCR the remainder of the page, a dialog box appears acknowledging the problem, and you either cancel the OCR or move on to the next page.
This seems to have happened to me in one production where the Bates ranges are text, but the rest of the page is scanned. I’d assume this is because the documents were scanned and then some program like Easy Bates or something similar applied a Bates range to the PDF.
Any thoughts?
I have indeed had problems OCR'ing documents once I did something to them (like bates-stamping the documents electronically). In other words, OCR works best if done right after you've scanned them. And of course, as I said before, it works best on clean copies (i.e. fax copies are usually not going to give you good results). So, if you intend to OCR your PDF documents then it's best to do that first, and then apply the bates-stamp. Of course, if any of you readers out there have other observations please share them in the comments section. Thanks.
I have found this message, too, telling me that Acrobat refuses to OCR my page because it has "renderable text" in the form of a small page number at the bottom. Unfortunately, there is no button for "OCR the rest of the page and never mind the renderable text".
My solution: After using the program to stamp page numbers, I "save as" the document with a trailing "-p", and keep the original for later use. Thus I always have an alternative without the renderable text, as well as a pristine original in case I find that the pages need to be reordered and repaginated at a later date.
Posted by: M. Sean Fosmire | April 16, 2006 at 11:59 PM
Dear Sir,
Our Team have 5 Year's Experience in this Field's, and we have 10 system's & 25 Man Power 3 Shift's to Work with Good Quailty of 99.995 % Quality of Output. Is ther'e any Outsouring for Pdf to Doc.
Best Regard's
A.Shankar.
Aarks Solutions
Mob :- +919884332298.
Land Line :- +914443552298.
Posted by: A.Shankar | April 19, 2006 at 01:47 AM
Somebody always looking to make a buck.
Anyway, Readiris Pro 11 for OS X does a good job of OCR. It provides the text beneath the scanned image option for PDFs. I would recommend trying that out for 30 days and see if you are happy with the results. Acrobat is just too costly for me.
Posted by: Kurtois | April 21, 2006 at 12:32 AM
Cutepdf Professional is 49.95 and it does Bates stamping of PDF documents. www.cutepdf.com.
Posted by: Robert | May 13, 2006 at 09:38 PM
A reccomended workflow is to scan to TIFF, stamp number, convert to PDF, and then OCR. The number is part of the image, but is: 1) clear and usually gives a clean / accurate result; and 2) can easily be referenced and included as metadata (either keyword or custom document property) in the PDF.
Posted by: Noah Katz | June 01, 2006 at 11:24 AM
DMI is a document management company that specializes in high volume PDF conversion from paper and microfilm / microfiche. We have had this and several other problems when trying to convert high volumes of images (10's or 100's of thousands) into fully text searchable PDF using Adobe products. Finally, out of frustration, we wrote our own conversion tools which are extremely accurate, FAST and reliable. Although not currently retail ready due to licensing issues, we will perform PDF conversion from TIFF images using our software for an extremely reasonable rate. We have a server bank capable of doing about 30,000 images per day to full text PDF from TIFF. Please EMAIL me or call (1-800-DMI-4210) if interested or if I can be of assitance in any other way. Thank you!
Posted by: William Swezey | June 01, 2006 at 12:19 PM
I have also run into this problem (trying to OCR a document that has some rendered text already on the target pages). This really is an annoying situation and I wish that there was some way to direct Acrobat to OCR user-designated page regions the way some scan programs will allow you to designate which part of the page to scan/OCR. I work in the energy industry and constantly access large filing documents posted to our regulator's site (www.ferc.gov). Since major filings are still made in paper, FERC scans some these submittals to tif and then converts them to non-text searchable pdf files. FERC places a (text recognized) doc info strip on the document. When we try to OCR these pages we get the dreaded messsage referenced in this forum thread.
After lots of tinkering we have found a workaround for this problem although it does come with some processing costs. The soluton is to try to convert the entire file back to a completely non-text searchable document and then run the OCR operations on that.
After opening one of these "mixed" files in Acrobat, select File>Print from the Acrobat menu and then choose Adobe printer as the target printer. At the bottom of the Print dialog box is a button labeled "Advanced." Selecting this opens another dialog box. In the upper portion of the dialog box is a checkbox "Print As Image" Activate this box, being mindful of the default resolution displayed (change it if you like). Close the Advanced box, return to the Print dialog box and then press Print. The resulting pdf file is now all NON-text-searchable. At this point simply run the OCR operations again to make the entire file text searchable.
All Acrobat users know that there is a rabbit hole's worth of of switches that can be thrown at any given Acrobat operations juncture, but I believe the above suggestion should work for novice users with the default values. I'm a long time Acrobat user with an abiding admiration for what Adobe has done with pdf even though there are many features that drive me crazy. Hope this suggestion helps some the users out there.
Thanks for the forum
Posted by: Erich Hunt | June 19, 2006 at 03:06 PM
New Acrobat 7.0.8 Professional user.
My frustration was (and is) trying to hilight portions of a PDF page that has come to me on a CD, or on an email attachment. I, also get into this "renderable text" problem and althought I can use the "Notes" OK, am unable to OCR and hilight.
Interestingly enough, it seems that some CD files WILL allow me to OCR them; others will not. I am beginning to understand from the threads that this is most likely to the way the file was saved/scanned/put on the CD in the first place.
I'll try some of the solutions above and see what happens.
Does this make sense to you experts? It seems to me as a "newbie" that Professional should allow hilighting and underlining without all this carry-on...
Many thanks for your help.
Posted by: David Reed | September 02, 2006 at 05:53 PM
That renderable page can be set right by working that page in photoshop. Using tools, levels has to be increased, Sharpen the text and if possible increase the image resolution quite bit. Then OCR recognize the page.
Posted by: naga sugavanan | September 20, 2006 at 01:26 AM
Short and sweet. Acrobat is absolutely horrible for "OCR'ing" a document. Half of the characters it renders are wrong.
I recommend NOT using Acrobat and trying something else along the lines of Aabbyy FineReader.
Posted by: Jon Minor | January 10, 2007 at 08:20 PM
I tried it, and it worked. thank you.
Posted by: Fern McBee | November 28, 2007 at 04:02 PM
I see this talks about adobe 6.0 but I have adobe 9.0 have you guys worked on an upgrader to get around the bate stamping problem.
I need to convert a tiff to pdf, then bate stamp it then ocr the file. However, it won't ocr after I bate stamp
Posted by: Daniel | February 17, 2010 at 07:26 PM