What's That Noise?! [Ian Kallen's Weblog]

All | LAMP | Music | Java | Ruby | The Agilist | Musings | Commute | Ball
Main | Next month (Jun 2006) »

20060506 Saturday May 06, 2006

The Evils of Blogger's URL Recycling

Blog publishing services typically propagate updates about new posts from blogs (ergo, new blogs too) by pinging or publishing a changes.xml file. But what none of the services provide is an "un-ping" -- blog indexing services such as Technorati don't know when a blog has been deleted from a service. I noticed this today when I found http://blogtrarian.blogspot.com/ participating in a link farm infesting Blogger's service. This can happen because Google's Blogger recycles URLs; when a blog is removed from the system, the URL is freed for reuse.

That particular URL is one that dates back to 2004, it was dormant for several months but just came to life recently with spam. The historic posts (until August 2005) look like normal blogging fare but the recent posts are clearly just splog content. We'll have to work on "un-pinging" so it's easier to distinguish dormant blogs and dead ones.

           

( May 06 2006, 03:13:14 PM PDT ) Permalink


20060505 Friday May 05, 2006

Google Is Full?

So Google's CEO Eric Schmidt says his servers are full, hmm. Tying that to SEO'ers griping about their indexing, Andrew Orlowski speculates that it's web spam besetting big daddy. Could be but the hard data isn't out in the wild. The numbers that we can see are that Google is spending several banana republics worth of GDP on capital expenses:

Google continued to make substantial capital investments, mainly in computer servers, networking equipment and its data centers. It spent $345 million on such items in the first quarter, more than double the level of last year. Yahoo, its closest rival, spent $142 million on capital expenses in the first quarter.
Referring to the sheer volume of Web site information, video and e-mail that Google's servers hold, Schmidt said: "Those machines are full. We have a huge machine crisis." (read more)

If the problem is spam, then certainly it's Google's own doing. The elephant in the room is that the acceleration of web spam everyone's talking about is fueled by AdSense, often aided and abetted by Blogger splogs, Google Pages, Google Base, etc. The spam ecosystem is within Google's capacity to reign in but the don't-be-evil company is making too much money on click fraud with plausible deniability to do anything about it. Is Google having problems handling web spam and "filling up" their machines? Cry me a river, all the way to the bank.

         

( May 05 2006, 02:09:19 PM PDT ) Permalink


20060504 Thursday May 04, 2006

Thwarting Kleptotorial

When I read the words on

Microsoft yesterday reached a tentative $70 million deal to settle a California class-action antitrust lawsuit, according to a statement by the law firm representing the plaintiffs in the suit.
at http://www.satishlive.info/?p=27 I had the distinct sense of deja-vu. So I ran some queries against Technorati's index and sho-nuf, I found the exact same content had already been published by InfoWorld. Ah, there was an attribution at the bottom... but InfoWorld didn't publish under a creative commons license. Looks like blatant theft.

Then I checked the next post (http://www.satishlive.info/?p=28) on that blog and read:

I took a new blog search tool called Sphere for a little spin this morning and found it useful.
... hey, didn't I just see that somewhere else? Yep, this time it was PC World and no attribution.

It's safe to surmise that this is kleptotorial laden with AdSense and stuffed into the update stream. I've seen screenscrapes and feedscrapes on splogs before but they're usually easier to identify visually, I had to look more carefully at this to note its spamminess. Is there a market in alerting publishers to copyright infringement? Obviously this stuff should be removed from Technorati's index but is there a more valuable service to publishers that should be provided here? How much would you pay to find out about misappropriations of your content? Is there a market for Technorati to do something like Plagiarism.org to fingerprint blog content?

             

( May 04 2006, 09:34:17 PM PDT ) Permalink


20060502 Tuesday May 02, 2006

The Colbert Smackdown


In case you've been hiding under a rock, the blogosphere is abuzz about Stephen Colbert's weekend dressing down of George Bush and just about everything else inside the beltway. If you haven't, see the c-span vids: or read the transcript

The chatter (even art work on flickr) about it is frantic. Thank You Stephen Colbert has 700 links right now (this is a blog that came into being less than 72 hours ago), it's getting about five or ten links per hour at the moment. The videos are the most linked-to youtube reels on Technorati. How wonderful it is to have an administration that is so bad, the opportunities for high humor are so many. Why did we invade Iraq?

                           

( May 02 2006, 09:27:30 PM PDT ) Permalink