July 30, 2007

Scanning & OCR

Rick Borstein, who runs the Acrobat for Legal Professionals blog, has a great article on scanning and OCR with Acrobat 8.  The article appears at the equally wonderful LLRX.com site (run by Sabrina Pacifici).  Rick is extremely knowledgeable and his article is a must-read if you are interested in scanning.

12:14 PM in Acrobat 8.0, OCR/Paper Capture, Scanners, Workflow | Permalink | Comments (0) | TrackBack

June 06, 2007

OCR problems in Acrobat 8?

For some reason, I have run into a problem when doing OCR on my documents in Acrobat 8.  I don't think that it's an inherent problem with Acrobat 8 (Mac version) because I've been able to do OCR many times before without any problem.  I think somehow I've engaged a preference setting or changed something inadvertently.

Here's what happens.  When I run OCR everything proceeds normally until I look at the resulting file.  For some reason, the margins of the original document have been shrunk to the point that text along the bottom (e.g. the page numbering and the last line of text) have disappeared.  Obviously, this is not acceptable.

I plan to email the folks at Adobe to find out how to troubleshoot this problem, and when I have an answer to the problem I'll report back.   Meanwhile, if anyone else has experienced this problem, or if anyone has a solution please leave a comment.

12:33 PM in Acrobat 8.0, OCR/Paper Capture | Permalink | Comments (0) | TrackBack

April 14, 2006

OCR, PDFs, and bates-numbered documents

Optical Character Recognition (or 'OCR') is a great tool.  As most of you know, when you have a scanned file it's basically just an image.  Even though the image may be a document that contains words the computer regards those words as pixels that it displays.  A word-processing file, by contrast, is an assemblage of characters that the computer can recognize as such, which is why you can word search a text-based document but not an scanned image.  Unless, you OCR the image file.

When you tell the computer to do OCR you are asking it to do something very sophisticated.  The computer has to analyze each assemblage of pixels to determine what character that assemblage might be.  The cleaner the pixels the better chance the computer will guess right when it decides what character it is.

Adobe Acrobat has long had an OCR function, but in prior versions it was called "paper capture."  Acrobat 6.0 was the first version that, to my mind, handled OCR reliably.  Acrobat 7.0 does an even better job, although it introduces other quirks in other areas that I'm not crazy about.  In any event, the OCR/Paper Capture function is a great tool in Acrobat because it keeps the image file intact but identifies the characters in the image file so that you can search across a document set for key words.  Obviously, this is a nice tool for litigators who deal with document productions.  And, so even though it takes a fair amount of time to OCR a document (approximately 15 seconds per page, more or less), it's often worthwhile.  Which brings me to reader mail.

Today I got a great question from a reader about a problem he had when he ran the OCR function in Acrobat 6.0:

One question regarding the OCR function – have you come across the problem where part of a scanned page had “renderable text” but the remainder does not?  Apparently Acrobat 6.0 decides that it cannot OCR the remainder of the page, a dialog box appears acknowledging the problem, and you either cancel the OCR or move on to the next page.

This seems to have happened to me in one production where the Bates ranges are text, but the rest of the page is scanned.  I’d assume this is because the documents were scanned and then some program like Easy Bates or something similar applied a Bates range to the PDF.

Any thoughts?

I have indeed had problems OCR'ing documents once I did something to them (like bates-stamping the documents electronically).  In other words, OCR works best if done right after you've scanned them.  And of course, as I said before, it works best on clean copies (i.e. fax copies are usually not going to give you good results).   So, if you intend to OCR your PDF documents then it's best to do that first, and then apply the bates-stamp.  Of course, if any of you readers out there have other observations please share them in the comments section.  Thanks.

11:32 AM in OCR/Paper Capture | Permalink | Comments (11) | TrackBack

June 28, 2005

The perfect scanner - more thoughts

At LegalTech there was a lot of information about great scanners.  One scanner that I was excited about was the Fujitsu ScanSnap, which is supposed to go for about $500 bundled with Acrobat Standard 7.0.  It's got a small footprint and supposedly scans at about 15 pages per minute.  I heard that there getting drivers for the Mac is dicey, but the Fujitsu rep told me they would be out by September or so. 

I was thinking the Fujitsu would be a good scanner to recommend (and it may well be) but then a knowledgeable tech consultant told me about The Xerox Documate 252.  Actually, he raved about it.  The Xerox 252 is a TWAIN-compliant sheet fed scanner with a USB 2.0 connection.  It can handle 25 pages per minute, or 50 in duplex mode (i.e. scanning both sides of a double-sided document).  It's going for about $850 at Amazon.com right now, and there is a great review on the product page from someone who raves that this scanner "changed his life."  According to the reviewer the scanner works well with both Macs and PCs.

10:05 PM in OCR/Paper Capture, Scanners | Permalink | Comments (9)

June 07, 2005

Seeking the Perfect Scanner

Link: Seeking the Perfect Scanner.

Some of the most-asked questions here at PDF for Lawyers involve scanning paper documents. PDFzone has an interesting article on one fellow's search for a sheet-fed duplex (i.e., it scans two-sided documents) scanner that will capture images directly to PDF.  Also included in the article is a link to another article about software that scans directly to PDF.

~~ Dave

03:36 PM in OCR/Paper Capture, Scanners | Permalink | Comments (2)

August 20, 2004

Paper Capture quirk

I typically run 'paper capture' (i.e. OCR, or 'optical character recognition') on documents that I have scanned in as PDF files. The point of running the Paper Capture is to make the text in the document searchable and indexable. Lately, I've been getting complaints from people that I send my PDF files to that they can't open them. I think I know what the problem is, and it's related to the Paper Capture function.

First of all I have Adobe 6.0 running on a Windows machine and that's what I use to do the Paper Capture on. I think that running the Paper Capture is making the file unreadable to people who only have the Adobe Reader program. The solution is to run function called 'Reduce File Size' (Menu/File/Reduce File Size) and choose compatibility with Acrobat 4.0 and higher. This seems to render the file viewable by people using the simple Adobe Reader Program.

Obviously, the whole point of using PDF files is that people should be able to view them easily. Maybe this problem is unique to my computer but somehow I think not. So, if you run 'Paper Capture' it is a good idea to run the 'Reduce File Size' operation immediately after you do that and save that file as your final version.

11:25 AM in OCR/Paper Capture | Permalink | Comments (2) | TrackBack

April 16, 2004

OCR Tutorial for Acrobat 4 and 5

The process for doing Optical Character Recognition using Acrobat 4 or 5 is similar to that outlined in the previous post on Acrobat 6. You open and convert an image file, and run the "Paper Capture" tool on it.

The most significant difference in the earlier versions is that, until Acrobat 6, you could only OCR 50 pages at a time. That limitation really hinders the usefulness of the earlier versions, and although you can work around it, it's a pain. Nevertheless, here are the steps. They are virtually identical for Mac and PC. I'll use the TIFF file example again.

In Acrobat, go to the File menu and choose Import. . .
In the dialog box, select your TIFF file and click OK. That will turn your TIFF image into a PDF image. You still need Acrobat to "read" it and convert the pictures of letters into actual text letters.
Go to Tools > Paper Capture > Capture Pages. This will give you a couple of choices. The PDF Output Style you want is "Original Image with Hidden Text." Click OK and you can select which pages to OCR (all, current, or a page range). Click OK again, and the OCR engine fires up. Go make tea. (Although, since you're doing a maximum of 50 pages, this won't take all that long.)

Don't forget to do FIle > Save. You now have a word-searchable document.

Here is a quick war story/workaround for the old 50 page limit. With the advent of Acrobat 6, I can't think of any reason why an attorney that has the need to OCR even a handful of scanned documents shouldn't upgrade. I actually used this method, but I wasn't billing anybody by the hour. . . and there weren't any alternatives ready to hand. I also did it on a Mac, so I managed to use AppleScript to automate some of it. (Which is the full extent of my programming ability.) I have no idea how you would do it on a PC. If you have the ability to automate this on a PC, get a quick programming moonlighting job and use the money to upgrade to Acrobat 6 or one of the other commercial programs...

If you have a big TIFF file you need to convert, get one of the excellent shareware image-handling programs that will allow you select 50 pages/images at a time, and split the big file into smaller ones. Create as many 50-page files as it takes. Open, convert, and OCR all of those files (that's where AppleScript comes in handy). Name them something like File 1, FIle 2 etc. or you're going to get mightily confused. Run the Batch Process that creates thumbnails for all of the files. Now, open File 1 in Acrobat, open the thumbnail pane and pull it all the way across the screen so you only see thumbnails. Navigate to last page. Use the various "add pages" or "append pages" commands to stick the next 50 pages into your PDF. Rinse and repeat as necessary. Save, and voila! You've got a great big OCR'd PDF.

Now that I look at this kludge, it makes me want to weep . . . (however, one does what one can with the available tools). There is really no longer any good reason to go to those lengths because Acrobat 6 and other PDF creation and OCR tools are widely available.

Hope this helps those who are still using the older versions.

-- Dave

09:24 AM in OCR/Paper Capture | Permalink | Comments (1) | TrackBack

April 14, 2004

OCR Tutorial for Acrobat 6

In going back through the comments and questions, I see that one area that concerns many people is how to use the OCR (Optical Character Recognition) abilities of Acrobat. Here's an overview, and I'll try to deal with other OCR issues very soon.

When you get a document that has been scanned, rather than exported from the software that created it, such as MS Word, it's just an image, i.e, a picture. Remember, to a computer, a picture of the letter "A" is not the same as the text character "A," so when you try to text-search an image, you get no hits because there's no text to search. Typical scanned litigation documents are in the TIFF (image) format. (There are also many software and hardware packages that scan paper directly into PDF. For now, I'm not going to address using Acrobat or other tools as the scanning software. For our purposes today, let's just say "you've got those image files that you want to convert into something you can search.")

The unique thing about PDF is that you can have an exact image of the document, plus the text, plus all kinds of metadata ALL IN ONE FILE. This is a wonderful thing -- but I will expound on its wonderfulness later.... With the "Paper Capture" tools in Acrobat, the software reads the picture, and figures out what the text is. So while you still see the "image," the software can also read the underlying text. OCR is not perfect, and it works best on first generation, laser printed images (just like your eyes do). In the past decade, however, OCR technology has gotten surprisingly accurate.

A couple of key points here. First, this discussion applies only to Acrobat, not to Reader. Second, prior to Acrobat 6, Adobe allowed you to perform "paper capture" with Acrobat only up to 50 pages. If you have Acrobat 4 or 5, you've got a 50-page limit (although, of course, there are ways to work around it.) I think that Adobe still offers the Capture Server product for large scale scanning and OCR work. It's meant for use in a high-volume production environment, such as a litigation support vendor. In my experience, in government at least, people were leery of using it because you paid by the page. That is, you could buy a 100,000 page license and then you have to fill 'er up again for the next 100,000. Acrobat 6 Professional allows you to "capture" or OCR large documents without buying the separate server, but is still not truly a substitute for industrial strength tools in a production environment. It is, however, capable of a surprising level of automation, and as far as I can tell, it's not dumbed down in its character recognition capabilities.

So here you are with a big old TIFF file. Or, if you are like me and occasionally have opposing counsel that just wants to jerk your chain, a PDF file that was produced in "image only" format from MS Word and contains no text.

In Acrobat 6, go to File > Create PDF > From File and select the TIFF file that you want to convert. That brings your image into the PDF format, but still doesn't make it word-searchable. [Note that you can also choose “From Multiple Files” if you want to do a batch. I’ll do a blurb on batch processing OCR in a later post.]

Now, go to Document > Paper Capture > Start Capture. The dialog that comes up gives you some choices. You can do a page, all pages, or a range (which might be a good choice if you have, say, a few pages of text followed by lots of charts). Be sure to click the “Edit” button to see the other things you can do, like select English as the recognized language. The PDF Output Style choice you probably want is “Searchable Image (Exact).” As a rule, I wouldn’t do any downsampling of the image, although this might reduce the size of the resulting file.

Click OK, and the OCR engine will start up. If you are running a normal Windows box of moderate memory and processor speeds, pretty much every other process will choke while Acrobat reads the document and converts the pictures of letters into text letters. If it's a heavily formatted, 1000 page document, go have lunch or save it for the end of the day because this is going to take a while. Adobe does provide a process window that keeps you apprised of events.

When it's done, don't forget to File > Save the document. And there you have it. (At this point, I always like to do a little test by running a quick search on a word that I see on the first page. It just makes me feel better to know that it worked. I also have a continuing dialogue about what to do with the original TIFF file...)

As I said, if your image file is from a laser printed copy, and it's a decent scan, the OCR accuracy is amazingly good. But it may have garbled some words, so if you want to get really fancy, go back to Document > Paper Capture and select "Find first OCR suspect" or “find all OCR suspects.” This identifies characters that the OCR engine had problems with, and gives you a chance to correct the text. You can fix the spelling if it's important to you --say for a proper name or term. That way you can be sure that the search software will find it. Otherwise, for a common word, I'd just save time and let it slide.

Hope this helps. Next up, batch processing OCR and a few of the subtle differences for those using Acrobat 4 or 5.

01:04 PM in Acrobat 6.0, OCR/Paper Capture, Workflow | Permalink | Comments (3) | TrackBack

November 04, 2003

Acrobat 6.0 does OCR quickly and effectively

Adobe has always called the OCR function in Acrobat "paper capture." And in previous versions of Acrobat this has been a weak function. For example, in Acrobat 5.0 you had to download a plug-in to do Paper Capture, and even then you could only OCR 50 pages at a time. And, if you did, you'd learn that the process was slow. I guess the idea was that if you wanted to do serious OCR you'd pay Adobe for the high-end Paper Capture product that they sell.

Well, I'm here to report that Acrobat Standard 6.0 does a great job with OCR/Paper Capture. I just OCR'd a 100 page deposition transcript (which was very good quality text) and Acrobat did the entire conversion process in about 4 minutes. I timed it at one point and it was converting pages at about a 20ppm rate, so it might actually have taken less time.

I have noted that this new version of Acrobat also converts TIFF files to PDF much quicker. So, despite my initial dissatisfaction with the radically revamped interface, I have to say this new version has some real power where it counts.

Update: I OCR'd a 600 page batch of documents, many of which were not very clean copies (i.e. the sort of documents that make OCR engines choke and sputter). I knew it would take longer than 20 ppm (and it did) so I set it to do the work as I was leaving for the day. When I returned it had hung up midway through, but it wasn't a big deal. I clicked the dialogue box that basically said "ignore this error in the future and keep working no matter what." It hummed along quite briskly and finished the task in about a half hour. The resulting file size was about 87 MBs, but I ran a "Save As" and it compressed down to 26 MBs. Very nice.

12:51 PM in Acrobat 6.0, OCR/Paper Capture | Permalink | Comments (4) | TrackBack