Published May 26th, 2003 by Jim O'Halloran
Bayesian Filtering in Aggregators?
I blogged this mainly because I wanted a reference to the original “A Plan for Spam” paper on Bayesian Filtering, and the follow up “Better Bayesian Filtering”, both by Paul Graham.
I’ve been toying with the idea of writing my own news aggregator for a while because I’ve yet to find something thats exactly what I want. But then my mind runs away with an idea. With most current aggregators, you read the entire feed, or none of it… If an email client (once trained) can use Bayesian classifiers to distinguish between spam, and real email, why couldn’t an aggregator use a similar idea to classify posts in an RSS feed?
Starting with my current subscriptions list, allow me to show the aggregator which articles/posts in the feeds I do like. and those that I don’t. Using from that information, begin to filter the feeds to hide some articles. For example I like to read whatever Jeremy writes about MySQL, flying, or search engines but I’m less interested in his test posts.
Perhaps, once suficiently trained, the aggregator could start to use Feedster to locate other stuff I might be interested in, and add that into the subscription/filtering process.
I guess to sum it up in one sentence, the perfect aggregator should cusomise itself to show me what I want to read, then go find new stuff for me to read once it knows what I want to read.
Dunno if I’ll ever get time to write something like that, but the idea’s free for anyone to pick up and use. If anyone does this let me know and I’ll point to it. If you happen to use it in a commercial product though I’d appreciate a free licence so I can play with the end result ![]()
Scott Johnson Says
Hi there,
That’s just an awesome idea. sjohnson AT fuzzygroup.com on any future ideas you have if you want me to think about how to get this into Feedster.
Thank you very much.
May 27th, 2003 at 8:23 pm
Kenneth Says
Very interesting idea. If the algorithm is to work for unknown weblogs and posts, how does it find them?
Would you pull the last 10-20 updates to weblogs.com and blo.gs for filtering?
Other than a service offered by a server, the only option I see is the client side doing the heavy lifting, and adding randomness to the mix for filtering seems a way to tackle this.
Perhaps jumping off of known weblogs blogroll’s would be another option…
May 27th, 2003 at 10:04 pm
Emmanuel Says
So maybe I will say something inaccurate here, but it seems to me that the solution to Bayesian filtering is very near. As far as I remember, this is exactly the technology that is used in the Mozilla mail reader to detect spam, isn’t it? If it is, then, we almost have it. I am reading feeds using nntp//rss, so I should be able to mark some incoming news as spam.
May 28th, 2003 at 6:26 am
jm Says
To discover new feeds, you might use
recommended reading-like algorithms
e.g. http://diveintomark.org/projects/recommended_reading/
(and anti-recommended reading, using the feeds you rejected as a source)
use the source/link information in RSS feed you like to autodiscover a new feed
Also source feeds like:
NewsIsFree Recent new sources or Syndic8 recently approved feeds are good ways to get new RSS feeds
May 28th, 2003 at 4:26 pm
Matt Griffith Says
Here are a few links you might be interested in:
Jon Udell: SpamBayes futures
http://weblog.infoworld.com/udell/2003/05/09.html
Sam Ruby: Beyond Backlinks
http://www.intertwingly.net/stories/2002/05/31/beyondBacklinks.html
John Beimler: Serval, an aggregator with Whuffie
http://john.beimler.org/serval_aggregator_first_post.html
Matt Griffith: Where is RSSBayes?
http://mattgriffith.net/PermaLink.aspx/129
May 29th, 2003 at 9:46 am
Mike Says
Instaed of a binary “show/hide” why not a score from 1 to 10, and then sorting articles and/or feeds in order of score?
That way, on the days you’re REALLY bored, you can still read all the 2’s. And on days you’re slammed, you can just read the 10’s.
May 30th, 2003 at 2:49 am
Martin Says
I think the bayesian filter approach (as in email/spam) has one major drawback/shortcoming when used for aggregators:
I’m not only interested in one kind of post/topic, but rather a few categories, that can be varied and very different. And also I might on accident find something totally new, which I’d in the future also like to read… A bayesian filter can’t help here.
What is needed is a way of ‘more like this’… A way to mark posts I like and have the aggregator show me otheres that are similar…
May 30th, 2003 at 5:15 am
Nick Lothian Says
If you are still interested in this, have a look at http://www.mackmo.com/nick/blog/java/?permalink=nntprssc4javailable.txt and http://www.mackmo.com/nick/blog/java/?permalink=classifier4jnntprss.txt
Jul 24th, 2003 at 12:06 pm
Kyvinh Says
Wow, shared consciousness, as they say… check out http://amphetarate.sf.net
It does exactly what people on this page are describing!
May 26th, 2004 at 7:10 am
Mike Says
Take a look at www.feedbeagle.com. You create some categories and assign some stories from RSS feeds to them. FeedBeagle can then automatically assign new stories to your categories based on a Bayesian analysis of their content.
Nov 18th, 2004 at 4:31 am