Around ten years ago, I got my first scanner. I think it was an HP Officejet V40. I remember it fondly, because it had a straight-through sheetfed scanner, which would happy feed through almost any office document – including card, and even stapled sheets – unlike my current Canon MX870, and most other modern All-In-Ones. From memory, it cost me thirty pounds second-hand.
The objective of this, was that I was going paperless… I’d keep statements and receipts for one year only, and everything older, I would scan and store on disk. [These days, I keep almost nothing on paper; everything goes from in-tray to scanner to disk (several of them!) ]. But with that came a challenge – how would I find the documents I was looking for, when the time came?
After various tests, enter Deductus. It appeared to be a hobbyist project by a guy who wrote various other disparate coding projects. It differentiated itself, because you could index offline disks – network shares, DVDs and CDs – and keep just the index online. It would even ask you to insert the disk so you could view the document that it found for you.
So… in those ten years, a lot of search software has come and gone, particularly with the Desktop Search Boom, where every internet search company bought a desktop search startup (I used Yahoo’s free labelled version of X1, then X1 Pro itself). All those fizzled or diverged, then came the boom of Web 2.0 and cloud storage, and now…. Well, where are we now?
Well, Lifehacker recommend… not much: Launchy, Everything, or Windows Search (MS finally getting reasonable Outlook and File System indexed search also sounded a death knell to other Desktop Search providers). This page also recommends Copernic (the current fave), and others like Agent Ransack and DocFetcher. Locate32 appears to be a big fave, and closest to what Deductus is.
But Deductus seems to still be the best!
This is the Deductus Index status for my ‘Lifestore’.
It stores all files I’ve ever collected – back past Doom on HD Floppy, and including TIF survey maps of Italy, VMs from various projects, over 15,000 photos and videos, and PDFs, PDFs, PDFs, of every piece of paper I’ve received in the last 10 years. This is the largest of three discs I have indexed – 414,000 files totalling 465GB, of which 170,000 have their contents indexed (ie. Are a supported filetype that has been ). That still includes PDF, but sadly not Office 2010+, or RAR5/7ZIP archives (a reason I don’t use them).
OK – 450GB of data. I’m looking for the receipt for my iPhone 4S – is it still in warranty?
OK – found it. In TWO SECONDS, I see I have a receipt from 2011. So, no, it’s over 2 years old.
How about the last thing I have with “iPhone” anywhere in the document?
OK – I bought a Lifeproof case start of last year. Damn, it’s also out of warranty (the microphone seal has detached from the case).
That also took TWO SECONDS. That’s to find an instance of the word “iPhone” inside an OCR’d PDF of a receipt, across half a terabyte of data that’s stored on an external drive not even plugged in, in a program written by a young guy probably still in college, 5 years before anyone had even heard of “big data”.
True – it took 6 hours to index all that data. But I just update it when I batch-update the contents of the disk, every six months of so.
Oh – and how about system requirements? Do I need to install Hadoop?
Nope. 3MB of installation space, and at least 32MB of RAM (ie. Installed total RAM) on a Pentium II CPU.
And there’s more
Finally – as if this isn’t enough, the author wrote a web app that could use the index, so users don’t even need to install the 3MB application.
So… where did I put that copy of DOOM?
OK, there’s the directory. Took 0.228s to return that 19-year-old result out of the 0.5TB, running on a single-core VM on my microserver….
….and it’s in a zipfile