We recently received a message asking "what's the best tool to do text searches of PDF files, regardless of whether they are text on image or not?"
In my most lawerly voice, I reply "it depends." I would like to cover the key aspects of this in more than one post, because, like everything related to PDF, there's a lot lurking beneath the surface.
First, some distinctions and definitions. There is a crucial difference between "text" and "image." If you are talking about documents that were created directly from another program (like MS Word) you don't really have to worry about separating those concepts. (Well, not yet . . .) However, if you are dealing with pages that were scanned from hard copy, you've got to do some conceptual work. Second, there is a difference (in Acrobat) between the SEARCH command, and the FIND command.
Layers
It's easy for most people to visualize that PDFs, like ogres and onions, have layers. For now, we'll deal with only 3 of them. First, the "image" layer. This is a picture -- like a TIFF image.
Key concept: YOU CAN'T SEARCH AN IMAGE ALONE.
Second, there is the "text" layer.
Key concept: SEARCH ENGINES SEARCH TEXT
If you just have "image" PDFs, you'll need to create the text in order to have something to search -- the process is called "Optical Character Recognition" or OCR. This is the same process that many lawyers are familiar with in the scan to TIFF/OCR/put it in Summation loop. Keep in mind that OCR is not even close to perfect, and scanned PDFs are subject to the same errors as any other OCR'd document.
There are about a zillion ways to create an "image + text" PDF file. One is to use the Acrobat "paper capture" tool. With the full version of Acrobat, you can take an existing image file or image-only PDF and "capture" the text. Note that you are limited to 50 pages per document with Acrobat alone. (You can't do "capture" with Acrobat Reader). There are separate "Acrobat Capture" products for high-volume scanning from vendors including Adobe and Doculex. I'll spend more time on the "capture" issues in a later post. Probably the easiest way to handle it is to take your CD-ROMs of images down to the local service bureau and have them do the conversion. Negotiate the price -- "per page" pricing is the standard (I assume because it's the lazy way), but it doesn't make sense as a pricing model.
Okay, so you have Image (not searchable) and Text (searchable) layers. The third layer is the "metadata" layer. This contains info like the author, date, and (very important) keywords that you assign to the document.
Key concept: The Metadata layer is SEARCHABLE.
Find v. Search
This is pretty easy. FIND (ctrl + F, or Edit >> Find) searches the text of the open document only. The FIND tool is on the toolbar -- it's the single set of binoculars. This is just like the FIND command in pretty much every Windows and Mac application.
SEARCH, on the other hand, searches a collection of documents. The SEARCH tool is the binoculars + sheet of paper (or whatever the heck that thing is) button. When you hit SEARCH, you fall into a deep pool of commands, indexes, catalogs, and advanced capabilities. Acrobat uses the Verity search engine, which has long been one of the industry standard desktop (and server) search tools. (I assume Adobe is still using Verity in v. 6 -- I haven't investigated.) This is a full-on, boolean, high-speed indexed search tool.
Search also allows you to search by keyword. (In a later post, I'll talk about assigning keywords, and using them effectively.)
One other key difference is that FIND just starts at the beginning of the document and searches page 1, page 2, etc. in order. SEARCH relies on an index, which makes it both faster and easier. Say you have 1000 page document (what the hell are you thinking? Is this really 200 documents all rolled into one big file? Break them up and index them -- it's worth the trouble.) FIND will take forever as you roll through in page order. SEARCH will give you a list of hits that will jump you right to your page. Much faster.
Search Tools
Pretty much every litigation support database (Summation, Concordance, etc.) will support PDF searches. But, using just the computer's file system and SEARCH, you can get excellent searching capabilities without any database at all. There are advantages to lit support databases, but they accrue mostly to litigation support firms that bill by paralegal-hour.
I'm also very excited about Apple's new Preview application that should be coming out soon with OS X 10.3 "Panther." Apple claims that it is the fastest PDF viewer ever. And the demo by Steve Jobs showed some very very cool search capabilities, which include a "hit list" in a "drawer" that slides out from the app. Stay tuned....
"Searching" Images
Having said all that, it is possible to look for documents or pages without searching the text. Your eye can certainly identify documents without reading them. Even a very small thumbnail will allow you to tell, say, an invoice from a letter. If you have to do it without text search, open your file up, and click the "Thumbnail" tab. (If there aren't thumbnails, create them.) Thumbnails don't have to line up single-file down the left side. Pull that divider bar all the way to the right side of the window. Depending on the size of your screen, you should get 8-10 little images in every row. Then start scrolling. . . Double click a thumbnail to read to page.
More to come . . .
Dave
Recent Comments