The process of making a scanned PDF searchable is often referred to as 'OCR', which simply means 'optical character recognition.' I don't typically OCR my office files, but I do OCR documents that are used in my cases. Why not OCR all of the documents that one scans? Quite simply it isn't worth the extra time it takes to run the OCR.
When I am scanning day-to-day stuff I want to get the documents digitized quickly and then toss the paper. If I had to OCR the stuff I scan every day it would make the process take at least 4 times longer. But with case documents I'm willing to OCR because (1) I tend to scan in large batches, as opposed to individual documents, and (2) the benefit of OCR is much more likely to be something I'll take advantage of, so the extra time it takes to get the documents digitized is worth it.
Of course, it's possible to batch OCR a bunch of PDFs at once. And if you want to do this I recommend Rick Borstein's excellent blog post on this subject. One thing that Rick's article doesn't cover is: what do you do if you want to have the batch process run automatically every night?
I'm not really sure, because I've never used any software to do this, but I can point to a couple of possible solutions (all of them Windows-only, and none of them inexpensive): (1) Autobahn DX, which costs between between $1,600 and $2,695 depending on which level you buy, and (2) File Convert, which has a $600 entry level version.
If any of you have addressed this issue and have suggestions I'd love to hear them. And if anyone knows a Mac way of having OCR run in batch at regular intervals that would be appreciated as well.
Update: and if you are interested in how to OCR PDFs inside of a Portfolio, Rick Borstein has a great article on that as well.
First of all, thanks for a great blog.
Secondly, ReadIiris Corporate 12 for Mac claims to have this possibility:
"Watched Folder
Efficient monitoring of a “watched folder” leads to round-the-clock production OCR: Readiris systematically executes the recognition of any image files dropped in a specific folder."
http://www.irislink.com/c2-1685-189/Readiris-12-for-Mac.aspx
And the cost of the program is apparently less than 500 USD.
Posted by: * | March 22, 2010 at 04:42 PM
Paper Jammed has an Applescript written to batch-OCR (and a reader helped improve it). I'm sure that it could be tweaked to run on a schedule. Paper Jammed speaks highly about the SnapScan.
http://paperjammed.com/2010/01/04/automate-scansnap-ocr-process-on-your-mac-with-applescript-snow-leopard-edition/
Posted by: L. Hernandez | March 23, 2010 at 03:12 PM
Thanks a lot for this article. I personally find a lot of information about OCR technology on www.ocrworld.com. They also have a forum and you can post your questions there.
Posted by: Nina | March 26, 2010 at 03:42 AM
The software product Hazel (http://www.noodlesoft.com/hazel.php) would be ideal for this.
There's a 3 part series on how to use Hazel with iTunes and music downloads (http://www.mactalk.com.au/2010/08/25/organising-media-with-hazel-pt-3/) which looks like it could be adapted to achieve the 'watching' and automation.
Posted by: iWarwick | September 02, 2010 at 12:12 AM
You might have it if it belongs to you,whereas you don't kvetch for it if it doesn't look within your life.
Posted by: Nike Shox Clearance | October 15, 2010 at 09:32 PM