What's That Noise?! [Ian Kallen's Weblog]

Wednesday March 04, 2009

New Crawlers At Technorati

A lot of changes are afoot at Technorati. Over the last year or so, we've been looking inward at the infrastructure and asking ourselves, "How can we do this better?". The data spigot that Technorati builds on was the first thing to focus on, it's a critical part in one leg of the back-end infrastructure tripod. The tripod consists of data acquisition, search and analytics Technorati; while the ping handling and queuing are relatively simple affairs the crawler is the most sophisticated of the data acquisition subsystems. It's proper functioning is critical to the functioning of the other legs; when it doesn't function well, search and analytics don't either (GIGO="garbage in/garbage out").

As Dorion mentioned recently, we're retiring the old crawler. Why are we giving the old crawler getting an engraved watch and showing it to the door? Well, old age is one reason. The original spider is a technology that dates back to 2003, the blogosphere has changed a lot since then and we have a much better developed understanding of the requirements. The original spider code has presented a sufficient number of GIGO-related and code maintenance challenges to warrant a complete re-thinking. It contrasts starkly with the replacement.

Data model: There are a lot of ways to derive structural information out of the pages and feeds that a blog presents. The old spider used event driven parses, building a complex state as it went with flat data structures (lists and hashes). The new one uses the composed web documents to populate a well-defined object model; all crawls normalize the semi-structured data found on the web to that model.
Crawl persistence: The old spider was hard-wired to persist the aforementioned data structure elements to relational databases (sharded MySQL instances) while it was parsing, so that the flow of saving parsed data was closely coupled with parsing events, forsaking transactional integrity and consuming costly resources. The new spider composes and saves its parse result as a big discreet object (not collections of little objects in an RDBMS). This reduced the hardware footprint by an order of magnitude.
Operational visibility: Whether a blog's page structure was understood (or not), the feed was well formed (or not) or any of the many other things that determine the success or quality of a blog's crawl was opaque under the old spider. With the new spider, detailed metadata and metrics are tracked during the crawl cycles. This better enables the team to support bloggers and extend the system's capabilities.
Unit tests: Wherever you have complex, critical software you want to have unit tests. The old spider had almost no unit tests and was developed in a way that made testing the things that mattered most exceptionally difficult. The new spider was developed with a test harness upfront, it now has hundreds of tests that validate thousands of aspects of the code. The test are uniformly invoked by the developers and automatically whenever the code is updated (AKA under continuous integration).

The old spider didn't leverage packages to logically separate the different concerns (fetching, parsing, validation, change determination, etc), the aforementioned flat data structures, mingling of concerns and absence of unit tests made changing it exceedingly difficult. Now, we have a whole that is greater than the sum of the parts; having a well defined data model, sensible persistence, operational visibility and unit tests has added up to an order of magnitude improvement across several dimensions. The real benefits are seen when we've shown that the system is easy to change, I mentioned this several weeks ago when I noted the ease with which we could adapt custom requirements to crawl the White House blog.

Another change that we've made is to the legacy assumption that everything that pings is a blog. That assumption proved to be increasingly untenable as the ping meme spread amongst those who didn't really understand the difference between some random page and a blog, nefarious publishers (spammers) and other perpetrators of spings. Over 90% of the pings hitting Technorati are rejected outright because they've been identified as invalid pings. A large portion of the remainder are later determined to be invalid but we now have a rigorous system in place for filtering out the noise. We've reduced the spam level considerably (as mentioned in a prior post). For instance, there's a whole genre of splogs that are pornography focused (hardcore pictures, paid affiliate links, etc) that previously plagued our data; now we've eliminated a lot of that nonsense from the index.

Here are a pair of charts showing the daily occurrence of a particular porn term in the index.

6 month retrospective as of November 3rd, 2008:

6 month retrospective today, 5 months later:

As you can see, that's an order of magnitude reduction; 90% of the occurrences of that term was spam.

So what's next for the crawler? We've got some stragglers on the old spider, we're going to migrate them over in the next few days. There are still a lot of issues to shake out, as with any new software (for instance, there are still some error recovery scenarios to deal with). But it's getting better all of the time (love that song). We'll be rolling out new tools internally for identifying where improvements are needed, ultimately we'd like to enable bloggers to help themselves to publish, get crawled, be found and recognized more effectively. And there are more changes afoot, stay tuned.

technorati web crawling software spam splogs

( Mar 04 2009, 08:31:16 PM PST ) Permalink
Comments [2]

Comments:

Sounds nifty! Couple of q's: what language are you writing the new one in? Due to the use of the term "package", I'd guess it's perl ;) In terms of saving the parse result as a big discrete object -- are you still storing those in an SQL db? or have you switch to a new store as well? (ISTR mention of Hadoop recently...)

Posted by Justin Mason on March 05, 2009 at 03:36 AM PST #

Both the old and the new are written in Python but they share almost no code in common. Python has concepts of packages and modules that and similar to Perl's and Java's but... different. Though in the past Perl and Java were my tools of choice, when faced with the fact that my collaborators generally won't be productive in those languages, it became clear to look elsewhere. And bug-chasing in the old spider had forced me to acquire expertise with Python :) Actually, one of the key factors leading us to use Python was lxml, a parsing library that provides a lot of API goodies and benefits from libxml2's performance. libxml2 has weirdities of its own but is generally faster than expat. For persistence, we wanted network accessible key/value storage. I wrote a minimalist WebDAV client (strangely, the existing Python clients sucked but writing one for the basic verbs was simple). The I put a layer on top that allows us to store millions of objects in a hashed filesytem; the calling code simply sees key/value storage.

Posted by Ian Kallen on March 05, 2009 at 03:54 AM PST #

Sun	Mon	Tue	Wed	Thu	Fri	Sat
« July 2025
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Today

What's That Noise?! [Ian Kallen's Weblog]

Wednesday March 04, 2009

Technorati

Who?

Twitter

links

calendar

search