Easy and cheap PDF Document Management (with OCR) on Mac OS X
I recently purchased a Canon CanoScan LiDE70 scanner today for $80, intending to set up a simple document management solution for all of my paper bills, statements, receipts, etc. My hope was to develop a workflow that easily supports scanning a sequence of paper documents into a single PDF containing both the scanned images and the text of the document. I would then store the PDFs in Yojimbo -- the application I currently use (and like) for my document management. After a bunch of fiddling and a couple of "a ha" moments, I'm happy to report a very easy to use system that surpasses my expectations.
The Canon scanner comes with a (sadly PowerPC only) application called CanoScan Toolbox, as well as assorted other "shovel-ware" applications, including OmniPage SE OCR software. The first step is to throw everything away except for the CanoScan Toolbox -- even the OmniPage OCR software. Just install the scanner driver and the CanoScan Toolbox, and either don't install, or throw away everything else.
I then created a simple Automator workflow to open a PDF file, apply a Quartz filter to reduce the image size and quality somewhat (adjust as necessary), and then open the file with Yojimbo. Yojimbo will then automatically import the file into its document library. Next, I edited the "PDF" workflow in the CanoScan software to open the scanned PDF file with my workflow.
I pushed the "PDF" button on the scanner, and my document was pretty quickly scanned, compresses, and imported into Yojimbo. But then I noticed something very cool -- the text in the PDF was selectable (!). Apparently the CanoScan Toolbox software has OCR built in -- scanning the package contents of the application bundle gives some hints that this is so. There are files in the package with names like "basicj.ocr" and "cocr_carbon.shlb" -- so I guess this where the OCR is taking place. Unlike OmniPage, there are no options to tweak, but then again -- everything seems to work exactly as I want it.
So that's it -- in fact I think the same software suite comes with the $40 CanoScan LiDE25, so for about $50 you can have an amazingly easy to follow path to a paperless home or office.
Categories: technology