The levers and dials of character set encoding can be overwhelming, just looking at the matrix supported by J2SE 1.4.2 gives me vertigo. Java's encoding conversion support is simple enough, if not garrulous:
String iso88591String = request.getParameter("q"); String utf8String = new String(iso88591String.getBytes("UTF-8"));But what do you do if you don't know what encoding you're dealing with to begin with? It looks as though there are a couple of ways to do it:
String q_unknown_japanese = request.getParameter(q); String q_unicode = new String(q_unknown.getBytes("ISO8859_1"),"JISAutoDetect");
Let's call the CGI specification what it is: a burned out and anemic teenager. While it seems kinda cool that Apache 2.2's is going to get mod_proxy_fcgi, I've long wondered about using AJP13 to interface with web application runtimes other than servlet containers.
Brian McCallister did a kick butt cut-to-the-chase preso on Ruby on Rails at ApacheCon in San Diego. I can imagine why he's gung-ho to get a FastCGI support upto date, it seems to be the the way to run RoR. But since learning that AJP13 was going to be (and now is) built in to Apache 2.2's mod_proxy framework, I've been thinking how much nicer it'd be for other application frameworks to also be able to run outside the HTTP request handling process/thread.
We have some services that run under mod_perl that I've been taking second (and third) looks at. Wouldn't it be nice to deploy that application independent of the HTTP server runtime as one can with a Java webapp? Essentially, when it's boiled down to bare metal, perhaps that's all FastCGI is but it, it... it's CGI! Isn't it just setting/getting global environment variables? STDIN/STDOUT/STDERR? Isn't that so, well, 1994? Maybe I need to think about it some more but that was my take away last time I built anything with FastCGI (admittedly, in the 1990's).
I found what looks like AJP13 protocol support for Perl. Even though I don't read Japanese I'll infer from the context that he was/is interested in the same thing. Though whenever I see "use threads" in Perl, I fear the worst. Anyway, the likelihood of me finding myself with the time on my hands to implement AJP13 in Ruby is low; first, I still need to learn Ruby enough to get crafty.
rubyonrails ruby java apache cgi fastcgi ajp13 perl mod_perl
( Jan 07 2006, 01:20:50 PM PST ) PermalinkAs I expected to hear about after first reading of Microsoft's policies were reported last summer, MSN has (as reported by msnbc.com) censored a Chinese blog at Beijing's request.
IMO, it behooves the Chinese speaking blogosphere outside of China to vigorously discuss this. Beijing will have to adapt or retreat into isolation, they (and the world) can't afford the latter.
microsoft msn china censorship
( Jan 07 2006, 08:49:20 AM PST ) PermalinkNo, not a typo. OSDL is something else. I'm interested in OSLD. I've used Language::Guess to detect languages in arbitrary text with Perl, it works pretty well. But how are folks solving the problem in Java?
It looks like Oracle has language detection as part of their "Globalization Development Kit" ... but what about open source? Sadly, the Nutch Language Identifier Plugin only supports European languages, no CJK. What are the other options?
opensource open source i18n language java perl nutch oracle
( Jan 06 2006, 02:22:54 PM PST ) PermalinkI ran a test to prove to myself that for simple XML documents, the best way to parse them may be to skip capital P parsing altogether and just use a plain-old regular expression pattern match.
The XML format I wanted to test is the response from the Technorati /bloginfo API. I threw together a Perl based benchmark quickly enough and here are the results:
Benchmark: timing 10000 iterations of regexp, xpath... regexp: 0 wallclock secs ( 0.13 usr + 0.00 sys = 0.13 CPU) @ 76923.08/s (n=10000) (warning: too few iterations for a reliable count) xpath: 137 wallclock secs (136.17 usr + 0.04 sys = 136.21 CPU) @ 73.42/s (n=10000)... the regexp parse was three orders of magnitude faster than the XPath parse. I'm curious now what the comparison would be for Java's regexp support versus, say, Jaxen and JDOM (which is how I usually do XPath in Java). In my dabblings with timings, Java regexp's are very fast. Apparently, Tim Bray found this as well.
Here's the Perl code:
#!/usr/bin/perl use XML::XPath; use XML::XPath::XMLParser; use XML::Parser; use Benchmark qw(:all) ; my $X = new XML::Parser(ParseParamEnt => 0); # non-validating parsing, please timethese(10000, { 'xpath' => \&xpath, 'regexp' => \®exp }); sub xpath { my $b = getBlog(); my $parser = XML::XPath::XMLParser->new(parser => $X); my $root_node = $parser->parse($b); my $xp = XML::XPath->new(context => $root_node); my $nodeset = $xp->find('/tapi/document/result/weblog/author'); die if ! defined($nodeset); } sub regexp { my $b = getBlog(); my ($author) = $b =~ m{<author>(.*)</author>}sm; die if ! defined($author); } sub getBlog { return q{<?xml version="1.0" encoding="utf-8"?> <!-- generator="Technorati API version 1.0 /bloginfo" --> <!DOCTYPE tapi PUBLIC "-//Technorati, Inc.//DTD TAPI 0.02//EN" "http://api.technorati.com/dtd/tapi-002.xml"> <tapi version="1.0"> <document> <result> <url>http://www.arachna.com/roller/page/spidaman</url> <weblog> <name>What's That Noise?! [Ian Kallen's Weblog]</name> <url>http://www.arachna.com/roller/page/spidaman</url> <rssurl>http://www.arachna.com/roller/rss/spidaman</rssurl> <atomurl></atomurl> <inboundblogs>6</inboundblogs> <inboundlinks>8</inboundlinks> <lastupdate>2006-01-02 18:38:03</lastupdate> <lastupdate-unixtime>1136255883</lastupdate-unixtime> <created>2004-02-23 12:04:51</created> <created-unixtime>1077566691</created-unixtime> <rank>false</rank> <lat>0.0</lat> <lon>0.0</lon> <lang>26110</lang> <author> <username>spidaman</username> <firstname>Ian</firstname> <lastname>Kallen</lastname> <thumbnailpicture>http://static.technorati.com/progimages/photo.jpg?uid=11648</thumbnailpicture> </author> </weblog> <inboundblogs>6</inboundblogs> <inboundlinks>8</inboundlinks> </result> </document> </tapi> }; }
For some of the messaging infrastructure at Technorati where the messages are real simple name/value constructs, we've been passing on using XML at all. Using a designated-character-delimited format string (say, tabs) that can be rapidly transformed into a java.util.Map (or a Perl hash, a Python dictionary, yadda yadda yea) and passing messages that way buys a lot of cheap milage. We like cheap milage.
xpath regexp perl java messaging technorati
( Jan 05 2006, 11:26:28 AM PST ) PermalinkNow that I'm messing around with a roller implementation from within the last 7 months (migrated from Roller 0.98 to 1.1), I'm going to work on closing the gap to 2.0. Migrating all of my apps from an old (3.x) version of MySQL to 4.1.x wasn't too bad. But it appears that somewhere along the way to Roller 2.0, somewhere in the MySQL upgrade cycle perhaps, the post <-> category mappings got mangled and that was resulting in NPE's when the system tries to fetch the categories.
In the meantime, I implemented embedding cosmos links in my posts by patching WEB-INF/classes/weblog.vm
(from the 1.1.2 release):
479,486c479 < #end < < #macro( showCosmosLink $entry ) < <a href="http://technorati.com/search/$absBaseURL/page/$userName/#formatDate($plainFormat $entry.PubTime )"><img < src="http://static.technorati.com/pix/icn-talkbubble.gif" < border="0" < title="Links to this Post" /></a> < #end --- > #endIn the velocity template, I just added:
#foreach( $entry in $entries ) <a name="$utilities.encode($entry.anchor)" id="$utilities.encode($entry.anchor)"></a> <b>$entry.title</b> #showEntryText($entry) <span class="dateStamp">(#showTimestamp($entry.pubTime))</span> #showEntryPermalink( $entry ) #showCosmosLink( $entry ) #showCommentsPageLink( $entry ) <br/> <br/> #endI think the POJO's and macros are different in 2.0 but I'll post a cosmos link update when I get there.
technorati roller velocity mysql
( Jan 04 2006, 07:29:26 AM PST ) PermalinkThis blog had a nice long vacation but it is now occupied, again. No, I wasn't in Borneo. I wasn't kidnapped by aliens (you never can be sure though, can you?). Nor was I in the hospital. I just found myself wanting to fix my blogging platform but always too busy to do it. So I just didn't blog at all (except for on my super secret alter-ego blogs). While my efforts at going from 0.98 to 2.0.x of Roller never seemed to work out, I did get it to a 1.1 release (hey, take a little progress if you can't get it all). Most of all, I ditched my old template and stylesheet, they were pretty long in the tooth... (I think) this seems a lot cleaner.
A lot has happened with Technorati, the blogosphere, my deep dives into various technologies and other stuff. And there's more to come. And it's a new year. And speaking of which, it's that time again.
So here are my New Years Resolutions:
Happy 2006!
( Jan 01 2006, 10:33:29 PM PST ) Permalink