Bayesian Aggregators - Part 2

Jim O'Halloran • May 28, 2003

my-blog-blogging

A couple of days ago, I blogged some thoughts for an aggregator using Bayesian filtering . Since Jeremy pointed to it on his Blog, its received quite a bit of attention.

Scott Johnson (of Feedster notes that there are Bayesian libraries for Perl, which is great news because Perl would probably have been my language of choice. He also asked if there was anyhting Feedster needed to facilitate this. Kenneth asks in a comment on my original post about how you'd go about finding new content for the aggregator. These questions are sort of related, so I'll try and address them both in one hit...

Once the Bayesian process is sufficiently trained, certain tokens (key words, sequences of words) become strong indicators of content I'm interested in (eg. "v8 supercars" or "linux" would probably be couple of mine). I was envisaging that once the aggregator was trained, along with pulling feeds I've subscribed to off the web, it could query feedster with the "interested" in these keywords.

I guess the beauty of it from your point of view is that I don't think there's anything we really need from the Feedster end... Because we're already set up to parse and display RSS, pulling feedster results in RSS would be the logical thing to do, and Feedster already allows me to do this. The only thing that might be useful is if the Feedster results RSS had a reference to the source RSS feed for each article (that way the aggregator can start polling the source feed if it proves interesting rather than hitting Feedster all the time). Unfortunately, I can't check whether that already exists or not at the moment. I reckon the feedster side of things is pretty well set up for what we'd need.

Kenneth's idea of using the blogroll on the blogs we're currently subscribed to would work quite well also.

Emanuel notes that he uses Mozilla to read his blogs via NNTP, and thinks the spam filtering in Moz (which uses Bayesian techniques) should do the job for him. Sounds reasonable to me.