Many moons later, Desktop Search is coming back to the forefront. I resisted a long time, but was finally interested due to the Indexer service in Windows XP and due to having a huge number of files and data in my life, a result of running a bunch of servers.
First the niceties: Despite having made much of my living over the years on non-Microsoft technologies (BSD, Linux, Java, Palm...), I quite like Microsoft products. Not, perhaps, compared to what they could be, but they do tend to win based not just on marketing but also being based on more than just "good enough". Microsoft Word, Excel, SQL Server, Visual Studio are all easy to use for the basics and very performant and powerful. WindowsXP itself is a lot easier to set up than Linux and easier to dive right into. Not as easy for the extreme customization, but serve the bulk better. On the other hand, I run a non-Microsoft mail server, non-Microsoft web servers, non-Microsoft DNS servers, and more also. I'm not a pawn either way. So when I discovered that Indexing service:
- Mostly works
- Is hard to use for boolean queries
- Requires iFilter DLLs that simply aren't available to expand to my needs
My requirements are:
|Index Configuration||Indexes get big. I really am from the old-school of computer set-up; I put data on a different
partition or drive from the system. My system drive isn't big enough for huge indexes. It doesn't need to be that big; data (and temp
files) go elsewhere.|
Simple requirement. Or so I thought.
|Most indexers and desktop search programs support email. By which they mean "Outlook". At this point, if you've read my previous blog entries, it won't surprise you to learn that I don't run Outlook. I run Thunderbird. With messages imported from Eudora when I ran that. So I need Thunderbird indexing.|
|Zip files||I set up scripts to automatically zip data from my servers and ftp it to a local external hard drive on my desktop. (I was using gz, which again tells you my basic philosophy, but the Windows Explorer can look inside Zip files. It doesn't do .gz files.) The zipping not only saves transfer time by compression but dramatically by reducing file count. Being able to find content that needs updating by doing a quick search through these, rather than switching over to the servers (which aren't indexed for performance reasons) would be beneficial.|
|MP3, JPEG||Unlike most users of digital cameras and MP3 files, I use my headers. Some of my MP3s (which are mostly from CDs I own) have extensive ID tags denoting albums and such and perhaps 2% have lyrics in them. Many of my JPEGs not only have my Canon camera data but also IPTC comments descripting them. Which then work with the Gallery creating program I wrote to automatically caption photos on upload.|
|PDF Files||Many of my sources wind up as PDFs, but more importantly, lots of APIs (example: The Palm Programming manuals) are distributed as PDFs. As are many manuals. What's the point of indexing data if I can't get those?|
|Booleans and Limits||This was part of why I started the search. What if I want to search for messages with, in Google terminology, "PALM AND SOUND AND NOT MP3", only across PDF files?|
The TestsThe tests I set-up and ran were:
- Find an email in Thunderbird by subject
- Find an email in Thunderbird by content
- Find content inside a ZIP file
- Find a song (MP3) by lyric excerpt from the Lyrics tag in the header
- Find a song (MP3) by ID3 v2 Comment tag content
- Find a song (MP3) by Album from the ID3 v2 tag (added after the previous two tests failed)
- Find a photo (JPEG) by IPTC Comment content
- Find a photo (JPEG) by EXIF Owner field (added after the above test failed)
- Find a PDF by text excerpt (a Palm API was used)
- Text search using boolean (AND, OR)
- Text search using entire phrase (to ensure documents with the words out-of-order don't match)
- Supports limiting searching by file type - search only MP3s or JPEGs for relevant tests above
- Source code searching (as installed)
- Palm Address Book and Memo Pad searching
|Thunderbird email||No support. Doesn't even scan them.||Passes. Does index and display appropriately, but searching just email, or by field does not appear to be supported.||Perfect. Can easily limit search to just email, search by content or by Subject or Date.||Fail; Doesn't even seem to be able to do a brute-force text search on them.|
|Zip File||Pass||Fail. Yep, Fail. Not a spec of success. Turns out you need to add a plug-in for this.||Fail. Complete failure. Ironically, they don't even claim to support ZIP, but do slightly support GZ and RAR.||Pass. No problems at all.|
|MP3 Lyrics||Fail||Fail||Fail||Pass. Yep, only Microsoft found the correct file. In fairness, I did install MP3Filter.dll the day before running these tests. But that's not even available for the others, and I don't know if it simple searched the entire file or if it knows the format. Either way, Microsoft won without Desktop|
|MP3 Album||Pass||Fail||Pass||Pass, see above|
|MP3 Notes||Of these four, only Microsoft could display the Lyrics appropriately. Not in the Search, but going to the results, in Properties - Summary, the Lyrics, Album and all other tags show appropriately in Explorer. Amazing, and even more so that the other three cannot do this even as theoretical enhancements over the operating system!|
|JPEG - IPTC||Fail||Fail||Fail||Fail|
|JPEG - EXIF Owner||Fail||Fail||Fail||Fail|
|JPEG - EXIF Camera Model||Fail||Fail||Fail||Pass Yep, again only Microsoft looks in the headers|
|Text Boolean||Pass||Fail - no OR support||Pass||N/A|
|Limit by Type||Pass||Fail - it's there but not via U.I.||Pass||Pass|
|Source Code||See Caveat||Fail||Pass||Pass|
|U.I. Speed||Slow||Fast||Fast||Medium, depending on search type|
|U.I. Power||Second Best||Worst||Best||Third Place|
- Yahoo! includes the file path in the search. So if you're searching for a file with, for example, "WAP" in it and you have a folder called "Swap Meet", everything in that file will match. Very inconvenient.
- Google may have been impacted by their dreadful interface. I never did determine (though I didn't spend hours on it) how to ensure that it had indexed my media files. There just doesn't seem to be any option for controlling which disks it is looking at.
- Some of the Microsoft searches were far slower than the others, largely because they were done on-the-fly rather than from index. Of course, they also succeeded where the others failed.
- The EXIF and IPTC headers were clearly a bit beyond what anyone expected. But they are found by grep and by many imaging tools I use. Yahoo
and Copernic claim to index MP3 or JPEG metadata; they just
don't appear to do it in reality. This may be a matter of definitions; they may be defining date, file size and image size as the meta data
that they index. Which would be technically accurate, but certainly not complete enough for anyone who actually uses the headers. In fact, Copernic's
Music: Full metadata indexing of iTunes, MP3, OGG, WMA and WAV music files.which obviously is not true. (I have an email in to them about this; if they respond, I will update this page.)
Pictures: Full metadata indexing of EXIF, JPEG, GIF picture files.
- Source Code Listing: I was hoping for Doxygen/JavaDoc style parsing. None delivered. As installed, Copernic indexed source code files and Microsoft was able to find via a search. Yahoo! supports adding the source extensions, which presumably would have caused it to pass.
- Regardless of where installed, all of these default to putting the Index in the C:\Documents and Settings area. I don't tend to believe in that; I prefer my data to be in a different area. Google's index could not be moved; the others could.
- Not tested but noted: Google can run in Opera, but brings up Internet Explorer regardless of what the default browser setting is. Meanwhile, Copernic recognizes IE and Firefox, but not Opera, for history scanning. Which is fine by me; I don't want my browse history scanned.
- There were dramatic differences in how indexing occurred. Yahoo! put the list of files together fastest, but did not index the contents until later. Copernic was the first to index all all them. Google indexed fewer than I expected total.
- Google Desktop creates it's own web server (port 4664 on my system) for the interface. The U.I. is very consistent with Google on the web, and results are displayed via IE (even if Opera is default) even when queried from the Google desktop bar.
- Index Size: I confess to being confused as to how Google could simultanously have the worst results, the fewest file types scanned and the largest index (by quite some margin.)
>grep "running out" *.mp3 Blondie_-_11-59.mp3 5 9:Time is running out. Blondie_-_11-59.mp3 23 9:Time is running out.Well, it's obviously there, and in plain-text. And the IPTC Comment in JPEG files is successfully searched also. Of course the output speed and formatting leave something to be desired, but still, the search is successful. Pathetic that these pups couldn't parse it.
PerformanceAll three of these systems hammer the system pretty hard during normal use. Having all three running at once is definitely not a good idea; you can watch them sucking down the processor on the Task Manager. They seem about equivalent to an active virus scanner. Uggah! It's so bad that I wound up turning them off simply to enable any real processing. (Although with only one running it probably wouldn't have seemed nearly as bad.)
Screen ShotsTwo sets of screen shots here. The first set is the results of a search for a well-populated MP3 file. The interesting thing here is that Yahoo! provides the best data back on the song. (Google, as reported earlier, cannot parse it at all.)
Yahoo!Yahoo! Desktop did the best at displaying the MP3 hit. Remember it didn't pass the tests, but apparently some fields are parsed.
Yahoo! failed the email tests completely, so no screen-shot for that!
CopernicCopernic did almost as well as Yahoo! on music display:
And Copernic also did fantastic on the email!
But did quite well on email headers
and email display
Even if it's a bit less convenient than Copernic to go through them.
SummaryRight now, the whole Desktop Search niche is just not ready for prime time. All three of the programs I downloaded oversold and underdelivered. Of the three, Google was clearly the loser. Copernic's interface and speed (lack of hanging) are nicer than Yahoo!, but Yahoo! supports Zip files while Copernic doesn't. Meanwhile, Copernic supports Thunderbird email, while Yahoo! doesn't. So of these three, Copernic is my winner by a nose, figuring I need email indexed a lot more than I need Zip files indexed.
Ironically, Microsoft Explorer (with Indexing) did just as well as they did at word processing files. And, to top it off, iFilter add-ins can easily extend Microsoft's capabilities; it's probably only a matter of time before someone, perhaps me, creates iFilters for the real ID3 tags, IPTC, EXIF and Thunderbird mbox mail. Which in turn means that I will probably be moving towards the Microsoft Desktop Search not because I consider it the best right now, but because I can make it the best for me, something I cannot do with the others.
That's an odd way to win a war.