There's a series of "Mac vs. PC" ad knock-offs for Ruby on Rails on YouTube, they're really funny. I'm starting to use Ruby in favor of Perl (or trying to) for a lot of everyday duct-tape stuff, it's a great language. Some of the hyperbole around ruby and rails and peace-on-earth are a little amusing too but for now, laugh along and let 'em have their fun!
I've had my fill of MySQL's quirks, so I thought I'd plumb for PostgreSQL's. So many things that MySQL is fast and loose about, PostgreSQL is strict and correct. However, I was fiddling around with PostgreSQL's equivalent to MySQL's enum
and found what I would expect a strict RDBMS to be strict about... not so strict.
PostgreSQL does not have enum
but there are a few different ways you can define your own data types and constraints and therefore prescribe your on constrained data type. This table definition will confine the values in 'selected' to 5 characters with the only options available being 'YES', 'NO' or 'MAYBE':
ikallen=# create table decision ( selected varchar(5) check (selected in ('YES','NO','MAYBE')) ); CREATE TABLE ikallen=# insert into decision values ('DUH'); ERROR: new row for relation "decision" violates check constraint "decision_selected_check" ikallen=# insert into decision values ('CLUELESS'); ERROR: value too long for type character varying(5) ikallen=# insert into decision values ('MAYBE'); INSERT 0 1I don't want to hear any whining about how diff-fi-cult constrained types are. Welcome to the NBA, where RDBMS' throw elbows. The flexibility you get from loosely constrained types will come back to bite you on your next programming lapse.
So what's wrong with this:
ikallen=# create table indecision ( selected varchar(5) check (selected in ('YES','NO','MAYBE SO')) ); CREATE TABLE ikallen=# insert into indecision values ('MAYBE');ERROR: new row for relation "indecision" violates check constraint "indecision_selected_check" ikallen=# insert into indecision values ('MAYBE SO'); ERROR: value too long for type character varying(5) ikallen=#'MAYBE SO' is in my list of allowed values but violates the width constraint. Should this have ever been allowed? Shouldn't PostgreSQL have complained vigorously when a column was defined with
varchar(5) check (selected in ('YES','NO','MAYBE SO'))
? Yes? No? Maybe?
Well, I think so.
One of the cool things about PostgreSQL is the ability to define a constrained type and use it in your table definitions:
ikallen=# create domain ynm varchar(5) check (value in ('YES','NO','MAYBE')); CREATE DOMAIN ikallen=# create table coolness ( choices ynm ); CREATE TABLE ikallen=# insert into coolness values ('nope'); ERROR: value for domain ynm violates check constraint "ynm_check" ikallen=# insert into coolness values ('YES'); INSERT 0 1Coolness!
Contrast with MySQL's retarded handling of what you'd expect to be a constraint violation:
mysql> create table decision ( choice enum('YES','NO','MAYBE') ); Query OK, 0 rows affected (0.01 sec) mysql> insert into decision values ('ouch'); Query OK, 1 row affected, 1 warning (0.03 sec) mysql> select * from decision; +--------+ | choice | +--------+ | | +--------+ 1 row in set (0.00 sec) mysql> select length(choice) from decision; +----------------+ | length(choice) | +----------------+ | 0 | +----------------+ 1 row in set (0.07 sec) mysql> insert into decision values ('MAYBE'); Query OK, 1 row affected (0.00 sec) mysql> select * from decision; +--------+ | choice | +--------+ | | | MAYBE | +--------+ 2 rows in set (0.00 sec) mysql> select length(choice) from decision; +----------------+ | length(choice) | +----------------+ | 0 | | 5 | +----------------+ 2 rows in set (0.00 sec)Ouch, indeed. Wudz up wit dat?
There are a few things that MySQL is really good for but if you want a SQL implementation does what you expect for data integrity, you should probably be looking elsewhere.
postgresql mysql rdbms databases
( May 16 2007, 07:33:00 PM PDT ) PermalinkI had to take a few days off of work last week because of my aching back, it was really a fog-of-pain for a few days but this week I'm on the mend and in beautiful Banff for the WWW 2007 conference. Actually, I'm mostly here for the AIRweb workshop but staying a few extra days to hear what folks are thinking about regarding the future of the web, online information retrieval, humanity, and so on.
The AIRweb submissions included a lot of web graph related research. Some of it makes quite intuitive sense: web spammers will link to their spam sites as well as legitimate sites (camouflage) but legitimate sites don't link to web spam sites. So some of the talks discussed the underlying linear algebra of these phenomenon (Anti-TrustRank and BadRank) or their inapplicability to identifying spam (TrustRank). The presentations about temporal patterns, spam term density, the effects of on-the-fly re-ranking and javascript redirection were quite interesting.
A lot of these rank-demotion and web graph heuristics aren't really central to the efforts we have at Technorati for thwarting splogs. We instrument the data streams for baseline behaviors of various features. It's more like an intrusion detection system because fundamentally, web spammers can't behave like "normal" publishers and still succeed; they have to compensate for their absense of popularity with all kinds of abnormal behaviors and those behaviors are quite intrusive if you're listening for them. And so we are. This is by no means perfect but we're doing way better than 80-20. It's my belief that as the web becomes more participatory and there are incentives and opportunities to inject junk into it, intrusion detection will as much a vital capability as search relevance rank demotion to maintain a high quality experience. At the close of the workshop, I proposed that the web spam research community tell us what they want; what can we do to help? I can only imagine that Technorati's data streams could prove useful for the growing challenges of the participant-driven and temporally sensitive web.
So that was yesterday.
This morning, Tim Berners-Lee kicked off with a keynote that touched on the successive innovations of email, the web, wikis and blogs. On the iterative nature of technological and social change, he drew a cycling diagram of the needs that emerge when changes occur and enjoy widespread adoption and the collaborative/creative forces that drive innovation. He laid out how the Semantic Web was the next iteration and complex meaning will be readily accessible on the web. OK, that's all well and good. However, I just don't buy this idea that the Semantic Web is ... the Web at all. We have a web for people (he ackowledged as much at the beginning of the talk) but the idea of having tons of detailed data representations for generalized browsers of really complex data... I just don't get why folks won't end up building domain specific apps anyway. Building UI's for "general data representation" means that you'll never really be able represent the domain specific qualities within some part of The Ontology. At least, I've never seen those things work. Useful apps need domain experts (champions of the end-user e.g. product managers) and engineers to build something that works for that domain. Generic UI's breakdown when dealing with the nuances of specific domains. I want a data-rich web for humans that is machine consumable (microformats), not a parallel-universe web of machine-oriented RDF. Anyway, thanks for inventing the web TBL and good luck all you Semantic Webbers. I think you'll need it.
I almost fell out of my chair though when TBL said that blog spam isn't really a problem. I'll surmise that he has a set feed reader repertoire (or, old school bookmarks) and doesn't use blog search much. While I think we've done a pretty good job spam scrubbing Technorati, the fact remains that there is a veritable ocean of pinging rubbish mongers engaging in underhanded payola schemes, kleptotorial and other nefarious endeavors out there. What spam you do see on Technorati is the tip of the ice berg. Tim, use our site, despite the ice berg tip :)
Side notes: when in Canada going to "google.com" gets redirected to "google.ca" which includes a toggle to search "The Web"/"Pages from Canada" ... amusing, ergo the graphic in this post. Also, I can't believe how long the days are here; about 3 hours more daylight than the San Francisco bay area!
So thanks to Brian Davison, Carlos Castillo and Kumar Chellapilla for putting together a great AIRweb program, good work guys! I'm heading home tomorrow.
www2007 w3c airweb webspam search spam splogs splog ping technorati webgraphs linear algebra microformats semantic web tim berners-lee intrusion detection google banff canada
( May 09 2007, 09:44:35 PM PDT ) PermalinkI've been asleep just about all day, the pain killers and muscle relaxants they gave me last night were that good.
It all started a few weeks ago when EBMUD sent me a water bill that indicated over three times our normal water usage (and three times the cost). Everything seemed fine with all of the household plumbing. I called for an inspection, their inspector didn't show up on the day I expected them. But we got a note left on the door saying that, while nobody is home, the water meter runs continuously and that our usage continues to be unusually high.
Over the weekend, I checked around the house more diligently. What I thought may have been a wet spot by the side of the garage (not far from a spigot) seemed like a good candidate, so I got the shovel and started digging. The soil didn't get much softer as I dug deeper. There was no specific motion or event that I recall being more vigorous than others but in the hours that followed, a pain in my lower back grew. And grew. And grew to a point of intensity that everything I did hurt in my lower back. Sitting down. Getting up from a sitting position. Laying down. Everything hurt, intensely! A doctor friend of mine told me that I musta skipped charter 2 of the "You're over 40 now" manual where it is specified not to do any more shoveling. Doh!
At the emergency room, they gave me a cocktail of toradol, dilaudid and phenergan and a prescription for soma and percocet. The shot last night really knocked me out, I've been asleep off and on most of the day today. I'm gonna be doing a lot of laying down with ice on my back. A lot of walking around. But not a lot of sitting. So, I'm writing this post woozy from the drugs but standing upright with the pooter on the kitchen counter. Gonna go for a walk next. I need to resolve things with the water company and the plumbing on our premises.
back pain shoveling soma percocet toradol dilaudid
( Apr 30 2007, 04:47:38 PM PDT ) PermalinkThe rhythm of the baseball is always about hot streaks and cold streaks. In the 2006 season, the Giants couldn't put together any sustained hot streaks; it was a dark time for Giants fans -- I don't think they won more than 3 games in row and that they only did a few times. The first weeks of 2007 baseball were even darker; losing 7 out of the first 9 games disheartened a lot of fans. But what a difference now, the Giants have gone from a polar chill to an equatorial blaze in a matter of weeks; they've won 9 of their last 10!
Matt Cain finally has the victory he's been deserving; he's got a 1.55 ERA but what should be a 4-0 record is only at 1-1 so far. I think we're gonna see his W:L ratio shifting favorably in the weeks ahead. Barry Bonds is getting pitches, and smashing them. I'm sure soon enough competing team managers will get the message: the old Barry is back and crushinger than ever and we'll see lots of 4 finger calls. But for now, enjoy the ride.
Yesterday's victory came on the backs of a partial relief squad (Todd Linden and
Lance Niekro) as Omar Vizquel and Dave Roberts took a rest (Roberts came on late in the as a pinch runner and scored).
Next up this evening, Russ Ortiz will duel against Brad Penny and I'm looking forward to an exciting game. Three words: beat sweep el aye!
giants san francisco giants los angeles dodgers baseball
( Apr 26 2007, 07:14:56 AM PDT ) PermalinkThere's a bunch of code that I haven't had to work on in months. Some of it predates my migration from PPC Powerbook to the Intel based MacBook Pro. Now that I'm dusting this stuff off, I'm running to binary incompatibilities that are messin' with my head. My recompiled my Apache 1.3/mod_perl installation just fine but doing a CVS up on the code I need to work on and updating the installation, there's a new CPAN dependency. No problem, use the CPAN shell. Oh, Class::Std::Utils depends on version.pm and it's ... the wrong architecture. Re-install version.pm. Next, XMLRPC::Lite is unhappy 'cause it depends on XML::Parser::Expat and it's ... the wrong architecture.
Aaaaugh!
The typical error looks like
mach-o, but wrong architecture at /System/Library/Perl/5.8.6/darwin-thread-multi-2level/DynaLoader.pmI just said "screw it" and typed "cpan -r" ... which looks to be the moral equivalent of "make world" from back in my FreeBSD days. Everything that has an XS interface just needs to be recompiled.
Compiling... compiling... compiling. I guess that'll give me time to write a blog post about it. OK, that's done, seems to have fixed things: back to work.
perl mac apple macosx intel expat cpan macbook pro powerbook
( Apr 25 2007, 05:19:37 PM PDT ) PermalinkI was working on an Evil Plan (tm) to serialize python feedparser results with simplejson.
parsedFeed = feedparser.parse(feedUrl) print simplejson.dumps(parsedFeed)Unfortunately, I'm hitting this:
TypeError: (2007, 4, 23, 16, 2, 7, 0, 113, 0) is not JSON serializableI'm suspecting there's a dictionary in there that has a tuple as key and that's not allowed in JSON-land. So much for simple! Looks like I'll be writing a custom serializer fror this. I was just trying to write a proof-of-concept demo; what I've proven is that just 'cause "simple" is in the name, doesn't mean I'll be able to do everything I want with it very simply.
I've had a long day. A good night's sleep and fresh eyes on it tomorrow will probably get this done but if yer reading this tonight and you happen to have something crafty up your sleeve for extending simplejson for things like this, let me know!
( Apr 23 2007, 10:50:21 PM PDT ) PermalinkI ran into a very peculiar case of an Apache 2.0.x installation with the worker MPM completely failing to spawn it's configured thread pool. The hardware and kernel versions weren't significantly different from other systems running Apache with the same configuration. Here are the worker MPM params in use:
ServerLimit 40 StartServers 20 MaxClients 2000 MinSpareThreads 50 MaxSpareThreads 2000 ThreadsPerChild 50 MaxRequestsPerChild 0But on this installation, same version of Apache and RedHat Enterprise Linux 4 like rest, every time httpd started it would cap the number threads spawned and leave these remarks in the error log:
[Fri Apr 20 22:54:24 2007] [alert] (12)Cannot allocate memory: apr_thread_create: unable to create worker thread
It turns out that a virtual memory parameter had been adjusted, vm.overcommit_memory had been set to 2 instead of 0. Here's the explanation of the parameters I found:
overcommit_memory is a value which sets the general kernel policy toward granting memory allocations. If the value is 0, then the kernel checks to determine if there is enough memory free to grant a memory request to a malloc call from an application. If there is enough memory, then the request is granted. Otherwise, it is denied and an error code is returned to the application. If the setting in this file is 1, the kernel allows all memory allocations, regardless of the current memory allocation state. If the value is set to 2, then the kernel grants allocations above the amount of physical RAM and swap in the system as defined by the overcommit_ratio value. Enabling this feature can be somewhat helpful in environments which allocate large amounts of memory expecting worst case scenarios but do not use it all.The vm.overcommit_ratio value is set to 50 on all of our systems but rather than fiddling with that, setting vm.overcommit_memory to 0 had the intended effect; Apache started right up and readily stood-up to load testing.
From Understanding Virtual Memory
So, if you're seeing these kind of evil messages in your Apache error log, use sysctl and check out the vm parameters. I haven't dug further into why the worker MPM was conflicting with this memory allocation config; next time I run into Aaron, I'm sure he'll have an explanation in his back pocket.
apache linux worker mpm threads redhat linux vm virtual memory
( Apr 22 2007, 08:19:57 PM PDT ) PermalinkI try keep my ride on the cluetrain rolling by listening to what users of the services I help maintain have to say. The Technorati support forums have provided me with a great opportunity to hear what problems Technorati's members are experiencing. For the uninitiated, Technorati's crawler analyzes web pages to identify blog posts, make them searchable and identify links that measure what the blogosphere is paying attention to. There are a fair number of blogs that get caught in our automated blog flagging; the service processes several million pings per day and amidst that throughput, there are going to be mistakes in the flagging heuristics (flagged blogs are, naturally called "flogs", sometimes they end up demoted as "splogs" but others, turn out to be legit blogs). I'm trying to reduce the mistake rate; the indexing hazards that folks run into are a source of much grief (it doesn't take much to find folks who are very vocal about such lapses).
So, I've been on a tear over the last few weeks chasing down problems in Technorati's crawler and identifying its failure conditions. It's code that, until recently, I've not been too intimate with but inheriting responsibility for its functioning has forced me to study it more closely and grasp a firmer command of python programming. A peculiar failure case that had me puzzled for a while involved blogs that had (sufficiently) well formed pages and feeds, there didn't seem to be anything wrong with the data that'd prevent us from indexing them and yet they consistently failed to get indexed. I first became aware of it in this topic
The issue moved to a new topic where an initial diagnosis I offered (corrupted gzip encoding from Apache 2.2's mod_deflate, I thought) didn't quite pan out. But follow-ups from Technorati users KilRoY66 and wa7son helped clarify that the culprit was the gzip encoding that wordpress was configured to do. Apache 2.2/mod_deflate, you're off the hook. Their blogs (TNTVillage blog and justaddwater.dk | Instant Usability & Web Standards, respectively) both used Apache 2.2 but they both are also hosted on wordpress.org installations. For reasons yet to be explained, python's gzip library detects the encoding returned by wordpress as corrupted. Thank you, Technorati members, for helping identify this issue!
I'm going to patch the code (based on Mark Pilgrim's openanything) to recover from encoding errors and raise a proper exception if it's truly unrecoverable (as it is currently, the code catches any exceptions from decompressing the bytes, prints a message and moves along, essentially swallowing a fundamental error). In the meantime, if you're not getting indexed by Technorati and you have wordpress' compression on, try turning it off and see if that makes a difference.
technorati python wordpress apache gzip cluetrain blogging splog spam
( Apr 21 2007, 02:15:01 PM PDT ) PermalinkI'm currently about 60 feet under Market Street in downtown San Francisco, inside a BART station. But I'm connected to the wifi_rail network with 5 bars. I haven't fired up any YouTube streams yet but for IM, twitter updates and ...blogging, this is groovy; I'll take it!
I haven't seen any official announcements about BART's wifi system but as a serendipitous user, I hope it's here to stay. In fact, I hope it's extended to cover the track between stations, the transbay tube and the east bay stations as well! Maybe I'm being a little over-appreciative (greedy).
bart wifi wireless commuting twitter blogging san francisco east bay
( Apr 20 2007, 07:57:26 PM PDT ) PermalinkCould the extra-inning push last night, kicked off by Barry Bond's tying slash homer in the 8th be the harbinger of baseball to come? I'm quite impressed with how Armando Benitez and Jonathan Sanchez held back the Cardinals long enough for 12th inning surge from Randy Winn, Omar Vizquel and Rich Aurilia. We're seeing real solid playing from those guys and Ray Durham. The pitching rotation is solid, the losses that Matt Cain has suffered... are really an injustice. The guy's pitched fantastic, if we see the run support turn on he'll be putting up the W's. I expect Barry Zito's shutout the other day to be the first of many. Noah Lowry, Matt Morris and Russ Ortiz get props too, those guys and much of the roster are pretty damned solid.
Today's 6-2 romp over the Cards has me thinking that the Giants won't be spending too much more time down there at the bottom of the division. I think the offensive slump from the season's start can be declared officially over. What remains to be seen is whether they can sustain this kind of solid play day in and day out. I have faith they will! Let's Go Giants!
Now if only the temperatures felt like baseball weather; it's cold!
san francisco giants giants baseball barry bonds barry zito rich aurilia omar vizquel ray durham matt cain
( Apr 19 2007, 06:57:28 PM PDT ) PermalinkI was recently stymied by an encoding error (the exception thrown was kicked off by UnicodeError) on a web page that was detected as utf-8, the W3 Validator said it was utf-8 but in all my efforts to get a parsing classes derived from python's SGMLParser, it consistently bombed out. I tried chardet:
>>> import chardet >>> import urllib >>> urlread = lambda url: urllib.urlopen(url).read() >>> chardet.detect(urlread(theurl)) {'confidence': 0.98999999999999999, 'encoding': 'utf-8'}...and yet the parser insisted that it had hit the "'ascii' codec can't decode byte XXXX in position YYYY: ordinal not in range(128)" error. WTF?!
On a hunch, I decided to try forcing it to be treated as utf-16 and then coercing it back to utf-8, like this
parser.feed(pagedata.encode("utf-16", "replace").encode("utf-8"))That worked!
I hate it when I follow an intuited hunch, it pans out and but I don't have any explanation as to why. I just don't know the details of python's character encoding behaviors to debug this further, most of my work is in those Curly Bracket languages :)
If any python experts are having any "OMG don't do that, here's why..." reactions, please let me know!
python utf8 character sets character encoding chardet sgmlparser
( Apr 16 2007, 11:28:31 AM PDT ) PermalinkThe underground metal scene of years gone by had reunion on a Thursday night at the Bottom of the Hill. The circumstance that summoned this event into being was a sad one, the tragic passing of Curtis Grant. Proceeds from the show went to Curtis' family. Amazingly, while so many of us have gone very separate ways, word still managed to get around. How surreal it was to see friends, roommates, x-bandmates, drinking buddies, partners in crime and everybody else (some of these guys I know from 7th grade) who emerged from the woodwork into the dimly lit, loud tweaky PA-system and drinks ambiance of Bottom of the Hill. Stranger still was running into people and, after so many years, not remembering their names or exactly who they were. But that didn't really matter, for on Some Enchanted Evening all shall gather whatever memories that still carry from decades gone by, re-introduce themselves and celebrate.
The evening's kick off with Mercenary and Mordred got things off to a bombastic start. American Heartbreak came out after them with a great set, they rocked! The Steve Scate's Mordred formation was awesome, so heavy! 20 years ago, I'd have never imagined that the cavemen-in-the-ice-berg could thaw out and turn it on but that's what it seemed like -- Ruthie's Inn 1985... ZAP! Bottom of the Hill 2007. Frozen in time... Sven still looks the same! Why isn't he all salt-n-peppa gray like me? Sven, where's your goddamned fountain o' youth? Maybe I'm just working too hard. I should find out what brand of vitamins he's been taking. And the years have been kind to Ron Quintana too, look at him here mugging it up with me. Metal Mania! Photo courtesy of umlaut, thanks
At the top of the bill was Anvil Chorus. Like the private Anvil Chorus reunion I wrote about a few years ago, this reminder of of what could have been, what should have been, a break-out act 25 years ago was mind blowing. Good coverage has been already been rendered by umlaut, it would be duplicative to go into the set they did in detail. Suffice to say, they are a superbly talented bunch and it was fantastic to see them perform! Thaen, Joe, Aaron, Doug and (whoever you were playing keyboards) - thanks!
Here's a lil "Blondes in Black" to get you in the moment:
But wait! There's more!
Kudo's to Eric Lannon for getting this together for Curtis' family. Good luck to Thaen, on his way to Tokyo to tour with Vicious Rumours!
So I think I've had my fill of nostalgia for now but I understand the 25th anniversary of the show I founded on KUSF 90.3 FM, Rampage Radio, is coming up in a few weeks. So maybe I'll see you there and if you want to hear some Black Sabbath or Merciful Fate at 7am, I might just be there to dish it out for old times sake!
anvil chorus mordred american heartbreak mercenary san francisco kusf rampage radio youtube heavy metal bottom of the hill
( Apr 15 2007, 01:40:33 AM PDT ) PermalinkAfter a few weeks of sleep walking to the batter's box, it's invigorating to see the San Francisco Giants bring on the show of force in last night's win in Pittsburgh. With Barry Bonds hitting a pair of HR's (no. 736 and 737!) and Russ Ortiz reeling in the strikeouts, we're finally seeing the team that I was imagining going into opening day: lotsa long ball for the innings and tight pitching for the outings.
It's really irritating that the Major League Baseball's websites are so... last century. Where are the blogs, widgets, microformats and feeds? Just for giggles, I took my Technorati Favorites and plugged them into a Giants page. I also wrote a little script that puts the next Giant's game in the header. An hCalendar implementation shouldn't be too hard. It's really lame that MLB doesn't just put that on their web sites; CSV files and instructions for importing into Outlook is just so silly. This morning, it is raining like mofo in the Bay Area, hmm... maybe a fine day to write a CSV to hCalendar converter.
technorati san francisco giants mlb major league baseball hcalendar microformats
( Apr 14 2007, 12:01:09 PM PDT ) PermalinkThe Bush administration and their friends run the gamut from "that's fishy" (WMD's? In Iraq?) to "that's wrong" (poor judgement of intelligence) to "more corrupt than any Presidential regime in history" -- there's no salvaging this presidency, it's an unmitigated train wreck. The judgement is not just regarding the subterfuge of warring on Iraq premised on phoney Al-queda links, the misinformation around the "we're winning in Iraq" meme, nor the recently illuminated goose-stepping in the Justice Department. George Bush, with the aid of Dick Cheney and Karl Rove, will certainly be judged by future retrospectives as the worst president in American history. Let's pile on the recently divulged shenanigans of Paul Wolfowitz, was one of the primary architects of the President's Big Lies of Foreign Policy. Wolfowitz has been perking up his personal dalliances on the tax payer's dime! The details emerging in the news this week (see Wolfowitz Apologizes For 'Mistake' - At World Bank, Boos Over Pay for Girlfriend) underscores what a buncha corrupt and loathsome creeps these hypocritical neo-con bozos are.
Impeachment proceedings? Criminal prosecution? Nuremberg trials? I'm not sure where it should stop but there clearly is much to be done. Throw out the bums!
wolfowitz bush cheney rove world bank
( Apr 13 2007, 08:50:47 AM PDT ) Permalink