What's That Noise?! [Ian Kallen's Weblog]

« Business Models in... | Main | Scaling Rails with... »

20090910 Thursday September 10, 2009

Hub-a-dub-bub, feed clouds in a tub

While at Technorati, I observed a distinct shift around summer of 2005 in the flow of pings from largely worthwhile pings to increasingly worthless ones. A ping is a simple message with a URL. Invented by Dave Winer, it was originally implemented with XML-RPC but RESTful variants also emerged. The intention of the message is "this URL updated." For a blog, the URL is the main page of the blog; it's not the feed URL, a post URL ("permalink") or any other resource on the site - it's just the main page. Immediately, that narrow scope has a problem; as a service that should do something with the ping, the notification has no more information about what changed on the site. A new post? Multiple new posts? Um, what are the new post URL(s)? Did the blogroll change? And so on. Content fetching and analysis is cheap; network latencies, parsing content and doing interesting stuff with it is easy at low volumes but is difficult to scale.

In essence, a ping is a cheap message to prompt an expensive analysis to detect change. It's an imbalanced market of costs and benefits. Furthermore, even if the ping had a richer payload of information, pings lack another very important component: authenticity. These cheap messages can be produced by anybody for any URL and the net result I observed was that lots of people produced pings for lots of things, most of which weren't representative of real blogs (other types of sites or just spam) or real changes on blogs, just worthless events that were resource intensive to operate on. When I left Technorati in March (2009), we were getting around 10 million pings a day, roughly 95% of them of no value. Pings are the SMTP of the blogosphere; weak identity systems, spammy and difficult to manage at scale.

Besides passively waiting for pings, the other method to find things that have changed is to poll sites. But polling is terribly inefficient. To address those inefficiencies, FriendFeed implemented Simple Update Protocol. SUP is reminiscent of another Dave Winer invention, changes.xml however SUP accounts for discovery and offers a more compact format.

But SUP wasn't the first effort to address the aforementioned deficiencies, 2005 saw a lot of activity around something called "feedmesh." Unfortunately, the activity degenerated into a lot of babble; noise I attribute to unclear objectives and ego battles but I'm sure others have their own interpretation. The feedmesh discussion petered out and little value beyond ping-o-matic and other ping relayers emerged from it. Shortly thereafter SixApart created their atom stream, I think spearheaded by Brad Fitzpatrick. The atom stream is essentially a long lived HTTP connection that streams atom elements to the client. The content flow was limited to SixApart's hosted publishing platforms (TypePad, LiveJournal and Vox) and the reliability wasn't that great, at least in the early days, but it was a big step in the right direction for the blog ecosystem. The atom stream was by far the most efficient way to propogate content from the publishing platform to interested parties such as Technorati operating search, aggregation and analytic systems. It eliminates the heavyweight chatter of cheap ping messages and the heavyweight process that follows: fetch page, discover feed, fetch feed, inspect contents and do stuff with it.

So here we are in 2009 and it feels like deja-vu all over again. This time Dave is promoting rssCloud. rssCloud does a waltz of change notification; the publisher notifies a hub, the hub broadcasts notifications to event subscribers and the subscriber does the same old content fetch and analysis cycle above. rssCloud seems fundamentally flawed in that it is dependent on IP addresses as identifiers. Notification receivers who must use new a IP address must re-subscribe. I'm not sure how an aggregator should distribute notificaton handling load across a pool of IP addresses. The assumption is that notification receivers will have stable, singular IP addresses; rssCloud appears scale limited and a support burden. The focus on specification simplicity has its merits, I think we all hate gratuitous complexity (observe the success of JSON over SOAP). However, Dave doesn't regard system operations as an important concern; he'll readily evangelize a message format and protocol but the real world operability is Other Peoples Problem.

Pubsubhubub (hey, Brad again) has a similar waltz but eliminates the content fetch and analysis cycle ergo it's fundamentally more efficient than rssCloud's waltz. Roughly, Pubsubhubub is to the SixApart atom stream what rssCloud is to old school XML-RPC ping. If I were still at Technorati (or working on event clients like Seesmic, Tweetdeck, etc), I would be implementing Pubsubhubbub and taking a pass on rssCloud. With both systems, I'd be concerned with a few issues. How can the authenticity of the event be trusted? Yes, we all like simplicity but looking at SMTP is apt; now mail systems must be aware of SPF, DKIM, SIDF, blahty blah blah. Its common for mail clients to access MTA's over cryptographically secure connections. Mail is a mess but it works, albeit with a bunch of junk overlaid. I guess this is why Wave Protocol has gathered interest. Anyway, Pubsubhubbub has a handshake during the subscription processes, though that looks like a malicious party could still spin up endpoints with bogus POSTs. I'd like to see an OAuth or digest authentication layer in the ecosystem. Yea, it's a little more complicated, but nothing onerous... suck it up. Whatever. Brad knows the authenticity rap, I mean he also invented OpenID (BTW, we adopted OpenID at Technorati back in 2007). At Technorati we had to implement ping throttling to combat extraneous ping bots; either daemons or cron jobs that just ping continuously, whether the URL has changed or not. You can't just blacklist that URL, we don't know it was the author generating those pings, there's no identity authenticity there. We resorted to blocking IP addresses but that scales poorly and creates other support problems. We had 1000's of domain names blocked and whole class C networks blocked but it was always a wackamole exercise; so much for an open blogosphere, a whitelist is the only thing that scales within those constraints. Meanwhile, back to rssCloud and Pubsubhubbub, what event delivery guarantees can be made? If a data center issue or something else interrupts event flow, are events spooled so that event consumption can be caught up? How can subscribers track the authenticity of the event originators? How can publishers keep track of who their subscribers are? Well, I'm not at Technorati anymore so I'm no longer losing any sleep over these kinds of concerns but do I care about the ecosystem. For more, Matt Mastracci is posting interesting stuff about Pubsubhubbub and a sound comparison was posted to Techcrunch yesterday.


( Sep 10 2009, 12:35:37 PM PDT ) Permalink


Post a Comment:

Comments are closed for this entry.