« Update on e-filing in US Courts | Main | Making a PDF for e-filing »
April 14, 2004
OCR Tutorial for Acrobat 6
In going back through the comments and questions, I see that one area that concerns many people is how to use the OCR (Optical Character Recognition) abilities of Acrobat. Here's an overview, and I'll try to deal with other OCR issues very soon.
When you get a document that has been scanned, rather than exported from the software that created it, such as MS Word, it's just an image, i.e, a picture. Remember, to a computer, a picture of the letter "A" is not the same as the text character "A," so when you try to text-search an image, you get no hits because there's no text to search. Typical scanned litigation documents are in the TIFF (image) format. (There are also many software and hardware packages that scan paper directly into PDF. For now, I'm not going to address using Acrobat or other tools as the scanning software. For our purposes today, let's just say "you've got those image files that you want to convert into something you can search.")
The unique thing about PDF is that you can have an exact image of the document, plus the text, plus all kinds of metadata ALL IN ONE FILE. This is a wonderful thing -- but I will expound on its wonderfulness later.... With the "Paper Capture" tools in Acrobat, the software reads the picture, and figures out what the text is. So while you still see the "image," the software can also read the underlying text. OCR is not perfect, and it works best on first generation, laser printed images (just like your eyes do). In the past decade, however, OCR technology has gotten surprisingly accurate.
A couple of key points here. First, this discussion applies only to Acrobat, not to Reader. Second, prior to Acrobat 6, Adobe allowed you to perform "paper capture" with Acrobat only up to 50 pages. If you have Acrobat 4 or 5, you've got a 50-page limit (although, of course, there are ways to work around it.) I think that Adobe still offers the Capture Server product for large scale scanning and OCR work. It's meant for use in a high-volume production environment, such as a litigation support vendor. In my experience, in government at least, people were leery of using it because you paid by the page. That is, you could buy a 100,000 page license and then you have to fill 'er up again for the next 100,000. Acrobat 6 Professional allows you to "capture" or OCR large documents without buying the separate server, but is still not truly a substitute for industrial strength tools in a production environment. It is, however, capable of a surprising level of automation, and as far as I can tell, it's not dumbed down in its character recognition capabilities.
So here you are with a big old TIFF file. Or, if you are like me and occasionally have opposing counsel that just wants to jerk your chain, a PDF file that was produced in "image only" format from MS Word and contains no text.
In Acrobat 6, go to File > Create PDF > From File and select the TIFF file that you want to convert. That brings your image into the PDF format, but still doesn't make it word-searchable. [Note that you can also choose “From Multiple Files” if you want to do a batch. I’ll do a blurb on batch processing OCR in a later post.]
Now, go to Document > Paper Capture > Start Capture. The dialog that comes up gives you some choices. You can do a page, all pages, or a range (which might be a good choice if you have, say, a few pages of text followed by lots of charts). Be sure to click the “Edit” button to see the other things you can do, like select English as the recognized language. The PDF Output Style choice you probably want is “Searchable Image (Exact).” As a rule, I wouldn’t do any downsampling of the image, although this might reduce the size of the resulting file.
Click OK, and the OCR engine will start up. If you are running a normal Windows box of moderate memory and processor speeds, pretty much every other process will choke while Acrobat reads the document and converts the pictures of letters into text letters. If it's a heavily formatted, 1000 page document, go have lunch or save it for the end of the day because this is going to take a while. Adobe does provide a process window that keeps you apprised of events.
When it's done, don't forget to File > Save the document. And there you have it. (At this point, I always like to do a little test by running a quick search on a word that I see on the first page. It just makes me feel better to know that it worked. I also have a continuing dialogue about what to do with the original TIFF file...)
As I said, if your image file is from a laser printed copy, and it's a decent scan, the OCR accuracy is amazingly good. But it may have garbled some words, so if you want to get really fancy, go back to Document > Paper Capture and select "Find first OCR suspect" or “find all OCR suspects.” This identifies characters that the OCR engine had problems with, and gives you a chance to correct the text. You can fix the spelling if it's important to you --say for a proper name or term. That way you can be sure that the search software will find it. Otherwise, for a common word, I'd just save time and let it slide.
Hope this helps. Next up, batch processing OCR and a few of the subtle differences for those using Acrobat 4 or 5.
05:04 PM in Acrobat 6.0, OCR/Paper Capture, Workflow | Permalink
TrackBack
TrackBack URL for this entry:
http://www.typepad.com/t/trackback/2167/631899
Listed below are links to weblogs that reference OCR Tutorial for Acrobat 6:
» "PDF For Lawyers: OCR Tutorial for Acrobat 6" from Stark County Law Library Blawg
From the blog: "Forget Barry Bonds. PDF for Lawyers returns to action with a towering home run: OCR Tutorial for [Read More]
Tracked on Apr 15, 2004 2:30:08 PM
» Thursday 16 April 1663 from and I dare not
to the Streights , wherein the demands are strangely irregular, and I dare not oppose it alone for making an enemy and [Read More]
Tracked on Apr 28, 2006 4:40:39 AM
» MUSIC: Pete's Needle and the Damage Done from rehabilitation,
be on his 9th life with the law .
Today, the beacon of damning celebrity exposes, The Sun, splashed supposedly recent photos of Doherty [Read More]
Tracked on May 1, 2006 2:07:49 AM
» Cavaliers' Playoff Lives Depend on Finding Ilgauskas from may be the biggest
on the floor, but Zydrunas Ilgauskas has become Cleveland's invisible man. [Read More]
Tracked on May 2, 2006 8:06:29 AM
» Study Shows Coffee OK For Heart from A leading cardiologist
the risk of heart disease. A leading cardiologist tells Rene Syler this may actually [Read More]
Tracked on May 2, 2006 10:04:30 AM
» National Project Urges Healthier Food Choices from programs, scholarships
programs, scholarships and good old-fashioned cooking and gardening tips. [Read More]
Tracked on May 3, 2006 7:17:10 PM
» Finding Information on the Internet from Net Search
The University of California Berkeley recommends search strategies, explains search tools, and gives guidance on evaluating and citing web pages... [Read More]
Tracked on May 4, 2006 12:02:43 PM
» US and EU rule out new food laws to fight obesity from chiefs on Monday
Monday ruled out imposing new regulations on the food industry to fight obesity, in spite [Read More]
Tracked on May 16, 2006 8:41:23 AM
» Delay In Drafting Parental Consent Form For Abortion Might Allow Physicians To Circumvent State Law, Texas Lawmaker Says from Affairs Committee
Houston Chronicle reports (Falkenberg, Houston Chronicle, 5/9). [click link for full [Read More]
Tracked on May 16, 2006 9:42:48 PM
» Mavs' Terry: It's No Time to Back Down from Dallas Mavericks'
Terry offers a piece from his playoff diary for the FORT WORTH STAR-TELEGRAM.
[Read More]
Tracked on May 21, 2006 10:17:08 PM
» Chicken farms protect against bird flu from to the Milford
Milford family chicken farm is like trying to infiltrate a high-security medical lab. [Read More]
Tracked on May 22, 2006 2:41:27 PM
» MUSIC: All You Need is Love: Macca's Marriage on the Rocks? from rumors are always
is one of the most famous people on the planet, while she feels she should get just as much respect for being a model and campaigner." [Read More]
Tracked on May 23, 2006 12:52:04 AM
» Companies Stress Nutrition For Kids from accompanying
childhood obesity " with the accompanying legislation, lawsuits, and media attention " at [Read More]
Tracked on May 23, 2006 12:05:05 PM
» Russia Said to Be on Edge of AIDS Crisis
(AP)
from AIDS epidemic,
face of Russias AIDS epidemic, epitomizing many of its most troubling characteristics. [Read More]
Tracked on May 24, 2006 12:05:09 PM
» Chimbonda Hands In Transfer Request from diabolical. There
this (2-4 defeat against Arsenal) is not that time. I have said to him we will not stand in his [Read More]
Tracked on May 26, 2006 7:00:01 PM
» Dictionary Search Page from Dictionary Search
An unabridged dictionary from aalii to zymurgy, including a pronunciation guide... [Read More]
Tracked on May 28, 2006 1:44:50 AM
» President of Iran Declares "I Was Born to Rock!" from Iranian President
something about the " so-called holocaust " and shipping Jews to Germany. Isn't that David Duke's platform?
Related:
Iranian President Says Israel [Read More]
Tracked on May 28, 2006 2:11:03 AM
» Croshere: Pacers Ready to Compete from one, so Austin
players who have played in the NBA Finals. The Indiana Pacers have one, so Austin Croshere has become [Read More]
Tracked on May 30, 2006 2:07:45 AM
» Amare Stoudemire Out for the Year from in his attempt
They play the Los Angeles Clippers tonight, so they should be able to rap it up with a win. [Read More]
Tracked on May 30, 2006 1:31:52 PM
» Results of 2006 NBA Draft Lottery from studio in Secaucus,
studio in Secaucus, New Jersey. The Toronto Raptors, who had an 8.8 percent chance of obtaining the first selection, will have the first overall [Read More]
Tracked on Jun 6, 2006 1:19:13 AM
» Ky. lieutenant gov. refuses to resign
(AP)
from asked Lt. Gov.
scandal, asked Lt. Gov. Steve Pence to resign after Pence announced he would not run for re-election with the governor next year. Pence [Read More]
Tracked on Jun 6, 2006 6:51:40 AM
» Broadview Networks Upgrades Core Metro Optical Transport Network with White Rock Networks? VLX2020 from next-generation
to upgrade their core optical network in several Northeastern markets. The replacement of their legacy transport equipment yields [Read More]
Tracked on Jun 7, 2006 7:24:57 AM
» Going Tough; Will Mavs get Going? from Horry went out
resumption. Robert Horry went out of his way to make shoulder-to-shoulder contact with Nowitzki, right on the sore spot. [Read More]
Tracked on Jun 7, 2006 10:58:17 AM
» AlexShay.com Receives Top Real Estate Award for Outstanding Website from Real Estate Librarys
recently received the Real Estate Librarys Pure Gold Award, for an outstanding real estate website. Alex Shay has been [Read More]
Tracked on Jun 8, 2006 3:35:33 AM
» First 40,000 SSIAs mature today from than 40,000 SSIA
ore than 40,000 SSIA account holders saw their investment mature today. [Read More]
Tracked on Jun 8, 2006 9:40:23 AM
» Terry sparks Mavs past Heat in Game 1
(AP)
from NBA finals game,
struggled in his first NBA finals game, so the Dallas Mavericks hitched a ride on the Jet to claim the opener. [Read More]
Tracked on Jun 10, 2006 7:33:46 AM
» NBA Finals Preview... Mavs vs. Heat from don't so they'll
Dirk Nowitzki at all, so he should have an MVP-type playoff series. Antoine Walker and Udonis Haslem will likely [Read More]
Tracked on Jun 11, 2006 11:02:08 AM
» Ainadamar Moves Up Billboard Chart from moved up to
moved up to number two on the Billboard classical chart this week, its fourth on the chart. [Read More]
Tracked on Jun 12, 2006 4:54:52 AM
» Choral Conductor Vance George Leaves SanFrancisco Symphony from director of the
the director of the San Francisco Symphony Chorus for the last 23 years, steps down this week, The San Francisco Chronicle reports. [Read More]
Tracked on Jun 17, 2006 8:18:37 AM
» New White Paper: GIS and Emergency Management in Indian Ocean Earthquake/Tsunami Disaster from the event, responses,
responses, and attempts to underscore the challenges of data sharing in a dynamic environment. [Read More]
Tracked on Jun 19, 2006 1:44:20 AM
» Mascot Mayhem, Gender Neutral Parents from weekly roundup
weekly roundup of politically correct shenanigans. [Read More]
Tracked on Jun 21, 2006 12:07:36 AM
» Ever Wanted to Be a Nibelung? Now's Your Chance, in Canadian Opera's Ring from are a committed
15-hour cycle, but it's not uncommon for them to fly across a country or even an ocean to do so. [Read More]
Tracked on Jun 30, 2006 6:42:02 PM
» Holy Crap, the Knicks are Insane from President and
Zeke was a helluva player on the court, but his front office and coaching resume is rather [Read More]
Tracked on Jul 24, 2006 6:13:04 PM
» CDBurnerXP without .NET from . I hope that
for testing purposes.
Again, thanks to Hacker Harry for his support on this issue. [Read More]
Tracked on Jul 27, 2006 2:25:27 PM
» Cardiac arrest victims make viable kidney donors from from a victim
with a kidney from a victim of out-of-hospital cardiac arrest do very well, new research suggests. [Read More]
Tracked on Aug 8, 2006 11:58:17 PM
» Pure Power Motorsports lives up to its name with 900 HP "Shadrach Edition" Mustang from bar, a pair of
manual has the unenviable task of directing all of this power rearward.
Think not that the rest [Read More]
Tracked on Aug 9, 2006 7:29:13 AM
» Microsoft Challenges Hackers On Vista from Windows operating
in Vista, the next generation of its Windows operating system. It has made a test version available to about 3,000 security pros. [Read More]
Tracked on Aug 10, 2006 5:44:57 AM
Comments
As a newbie to Acrobat scanning docs for archiving, this information was extremely useful.
Thanks
Posted by: Rob Pilgrim | Mar 29, 2006 9:27:39 AM
Thanks for your info on how to do Text Capture for OCR. It was very helpful, and it saved me the $40 that Adobe charges for the same answer.
Posted by: Bharrington | Sep 19, 2006 3:21:18 PM
Thanks! This was SO useful.
Posted by: elsa | Sep 21, 2006 5:56:14 PM




