« Cook hits nail on head. | Main | Worse than mystery meat... »

March 19, 2003

Martin's being cunning with search again.

I'm continually amazed at some of the cleverness that Martin 'currybet' Belam gets up to on his trawls through bbc search data. He's recently been working on identifying real names within bbc search terms.

He's got some interesting thoughts on algorithms, but seems stymied by issues of context. I'm concerned that he seems to be building vast lookup tables to solve these problems - surely, if you're trying to get validity from search terms, the best thing to judge it against is the data you're searching? It's a huge statistical sample, so it can tell you whether you should be grouping 'William' 'S' and 'Burrows' together in search terms, just because those words occur more often in the corpus of what's being searched.

As far as I'm concerned, there's only one 'special case' algorithms like this should have - for the band that committed internet suicide - 'The The'. That must have seemed pretty clever before search engines came along and declared your entire name redundant data.

Posted by Tom Dolan at March 19, 2003 12:33 AM

Trackback Pings

TrackBack URL for this entry:


Post a comment

Thanks for signing in, . Now you can comment. (sign out)

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Remember me?