Published May 26th, 2003 by Jim O'Halloran

Bayesian Filtering in Aggregators?

I blogged this mainly because I wanted a reference to the original “A Plan for Spam” paper on Bayesian Filtering, and the follow up “Better Bayesian Filtering”, both by Paul Graham.

I’ve been toying with the idea of writing my own news aggregator for a while because I’ve yet to find something thats exactly what I want. But then my mind runs away with an idea. With most current aggregators, you read the entire feed, or none of it… If an email client (once trained) can use Bayesian classifiers to distinguish between spam, and real email, why couldn’t an aggregator use a similar idea to classify posts in an RSS feed?

Starting with my current subscriptions list, allow me to show the aggregator which articles/posts in the feeds I do like. and those that I don’t. Using from that information, begin to filter the feeds to hide some articles. For example I like to read whatever Jeremy writes about MySQL, flying, or search engines but I’m less interested in his test posts.

Perhaps, once suficiently trained, the aggregator could start to use Feedster to locate other stuff I might be interested in, and add that into the subscription/filtering process.

I guess to sum it up in one sentence, the perfect aggregator should cusomise itself to show me what I want to read, then go find new stuff for me to read once it knows what I want to read.

Dunno if I’ll ever get time to write something like that, but the idea’s free for anyone to pick up and use. If anyone does this let me know and I’ll point to it. If you happen to use it in a commercial product though I’d appreciate a free licence so I can play with the end result :)


14 Responses to “Bayesian Filtering in Aggregators?”

Feed for this Entry
  1. 1

    Scott Johnson Says

    Hi there,

    That’s just an awesome idea. sjohnson AT fuzzygroup.com on any future ideas you have if you want me to think about how to get this into Feedster.

    Thank you very much.

  2. 2

    Kenneth Says

    Very interesting idea. If the algorithm is to work for unknown weblogs and posts, how does it find them?

    Would you pull the last 10-20 updates to weblogs.com and blo.gs for filtering?

    Other than a service offered by a server, the only option I see is the client side doing the heavy lifting, and adding randomness to the mix for filtering seems a way to tackle this.

    Perhaps jumping off of known weblogs blogroll’s would be another option…

  3. 3

    Emmanuel Says

    So maybe I will say something inaccurate here, but it seems to me that the solution to Bayesian filtering is very near. As far as I remember, this is exactly the technology that is used in the Mozilla mail reader to detect spam, isn’t it? If it is, then, we almost have it. I am reading feeds using nntp//rss, so I should be able to mark some incoming news as spam.

  4. 4

    jm Says

    To discover new feeds, you might use
    recommended reading-like algorithms
    e.g. http://diveintomark.org/projects/recommended_reading/
    (and anti-recommended reading, using the feeds you rejected as a source)
    use the source/link information in RSS feed you like to autodiscover a new feed

    Also source feeds like:
    NewsIsFree Recent new sources or Syndic8 recently approved feeds are good ways to get new RSS feeds

  5. 5

    Matt Griffith Says

    Here are a few links you might be interested in:

    Jon Udell: SpamBayes futures
    http://weblog.infoworld.com/udell/2003/05/09.html

    Sam Ruby: Beyond Backlinks
    http://www.intertwingly.net/stories/2002/05/31/beyondBacklinks.html

    John Beimler: Serval, an aggregator with Whuffie
    http://john.beimler.org/serval_aggregator_first_post.html

    Matt Griffith: Where is RSSBayes?
    http://mattgriffith.net/PermaLink.aspx/129

  6. 6

    Mike Says

    Instaed of a binary “show/hide” why not a score from 1 to 10, and then sorting articles and/or feeds in order of score?

    That way, on the days you’re REALLY bored, you can still read all the 2’s. And on days you’re slammed, you can just read the 10’s.

  7. 7

    Martin Says

    I think the bayesian filter approach (as in email/spam) has one major drawback/shortcoming when used for aggregators:
    I’m not only interested in one kind of post/topic, but rather a few categories, that can be varied and very different. And also I might on accident find something totally new, which I’d in the future also like to read… A bayesian filter can’t help here.

    What is needed is a way of ‘more like this’… A way to mark posts I like and have the aggregator show me otheres that are similar…

  8. 8

    Nick Lothian Says

  9. 9

    Kyvinh Says

    Wow, shared consciousness, as they say… check out http://amphetarate.sf.net
    It does exactly what people on this page are describing!

  10. 10

    Mike Says

    Take a look at www.feedbeagle.com. You create some categories and assign some stories from RSS feeds to them. FeedBeagle can then automatically assign new stories to your categories based on a Bayesian analysis of their content.

  1. 1

    Jeremy Zawodny's blog

    Trackback on May 27th, 2003 at 12:48 pm
  2. 2

    PHP Complete

    Trackback on May 27th, 2003 at 9:42 pm
  3. 3

    l.m.orchard

    Trackback on May 28th, 2003 at 1:33 pm
  4. 4

    a little ludwig goes a long way

    Trackback on May 29th, 2003 at 12:25 am

Leave a Reply

XHTML: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>