« E-Filing in New Orleans Federal Court | Main | Cheap software that produces PDFs »

April 14, 2006

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c683553ef00d834323e5653ef

Listed below are links to weblogs that reference OCR, PDFs, and bates-numbered documents:

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

M. Sean Fosmire

I have found this message, too, telling me that Acrobat refuses to OCR my page because it has "renderable text" in the form of a small page number at the bottom. Unfortunately, there is no button for "OCR the rest of the page and never mind the renderable text".

My solution: After using the program to stamp page numbers, I "save as" the document with a trailing "-p", and keep the original for later use. Thus I always have an alternative without the renderable text, as well as a pristine original in case I find that the pages need to be reordered and repaginated at a later date.

A.Shankar

Dear Sir,

Our Team have 5 Year's Experience in this Field's, and we have 10 system's & 25 Man Power 3 Shift's to Work with Good Quailty of 99.995 % Quality of Output. Is ther'e any Outsouring for Pdf to Doc.

Best Regard's


A.Shankar.
Aarks Solutions
Mob :- +919884332298.
Land Line :- +914443552298.

Kurtois

Somebody always looking to make a buck.

Anyway, Readiris Pro 11 for OS X does a good job of OCR. It provides the text beneath the scanned image option for PDFs. I would recommend trying that out for 30 days and see if you are happy with the results. Acrobat is just too costly for me.

Robert

Cutepdf Professional is 49.95 and it does Bates stamping of PDF documents. www.cutepdf.com.

Noah Katz

A reccomended workflow is to scan to TIFF, stamp number, convert to PDF, and then OCR. The number is part of the image, but is: 1) clear and usually gives a clean / accurate result; and 2) can easily be referenced and included as metadata (either keyword or custom document property) in the PDF.

William Swezey

DMI is a document management company that specializes in high volume PDF conversion from paper and microfilm / microfiche. We have had this and several other problems when trying to convert high volumes of images (10's or 100's of thousands) into fully text searchable PDF using Adobe products. Finally, out of frustration, we wrote our own conversion tools which are extremely accurate, FAST and reliable. Although not currently retail ready due to licensing issues, we will perform PDF conversion from TIFF images using our software for an extremely reasonable rate. We have a server bank capable of doing about 30,000 images per day to full text PDF from TIFF. Please EMAIL me or call (1-800-DMI-4210) if interested or if I can be of assitance in any other way. Thank you!

Erich Hunt

I have also run into this problem (trying to OCR a document that has some rendered text already on the target pages). This really is an annoying situation and I wish that there was some way to direct Acrobat to OCR user-designated page regions the way some scan programs will allow you to designate which part of the page to scan/OCR. I work in the energy industry and constantly access large filing documents posted to our regulator's site (www.ferc.gov). Since major filings are still made in paper, FERC scans some these submittals to tif and then converts them to non-text searchable pdf files. FERC places a (text recognized) doc info strip on the document. When we try to OCR these pages we get the dreaded messsage referenced in this forum thread.

After lots of tinkering we have found a workaround for this problem although it does come with some processing costs. The soluton is to try to convert the entire file back to a completely non-text searchable document and then run the OCR operations on that.

After opening one of these "mixed" files in Acrobat, select File>Print from the Acrobat menu and then choose Adobe printer as the target printer. At the bottom of the Print dialog box is a button labeled "Advanced." Selecting this opens another dialog box. In the upper portion of the dialog box is a checkbox "Print As Image" Activate this box, being mindful of the default resolution displayed (change it if you like). Close the Advanced box, return to the Print dialog box and then press Print. The resulting pdf file is now all NON-text-searchable. At this point simply run the OCR operations again to make the entire file text searchable.

All Acrobat users know that there is a rabbit hole's worth of of switches that can be thrown at any given Acrobat operations juncture, but I believe the above suggestion should work for novice users with the default values. I'm a long time Acrobat user with an abiding admiration for what Adobe has done with pdf even though there are many features that drive me crazy. Hope this suggestion helps some the users out there.

Thanks for the forum

David Reed

New Acrobat 7.0.8 Professional user.

My frustration was (and is) trying to hilight portions of a PDF page that has come to me on a CD, or on an email attachment. I, also get into this "renderable text" problem and althought I can use the "Notes" OK, am unable to OCR and hilight.

Interestingly enough, it seems that some CD files WILL allow me to OCR them; others will not. I am beginning to understand from the threads that this is most likely to the way the file was saved/scanned/put on the CD in the first place.

I'll try some of the solutions above and see what happens.

Does this make sense to you experts? It seems to me as a "newbie" that Professional should allow hilighting and underlining without all this carry-on...

Many thanks for your help.

naga sugavanan

That renderable page can be set right by working that page in photoshop. Using tools, levels has to be increased, Sharpen the text and if possible increase the image resolution quite bit. Then OCR recognize the page.

Jon Minor

Short and sweet. Acrobat is absolutely horrible for "OCR'ing" a document. Half of the characters it renders are wrong.

I recommend NOT using Acrobat and trying something else along the lines of Aabbyy FineReader.

Fern McBee

I tried it, and it worked. thank you.

Daniel

I see this talks about adobe 6.0 but I have adobe 9.0 have you guys worked on an upgrader to get around the bate stamping problem.

I need to convert a tiff to pdf, then bate stamp it then ocr the file. However, it won't ocr after I bate stamp

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Comments are moderated, and will not appear until the author has approved them.

Twitter Updates

    follow me on Twitter