This post to one of the Hadoop mailing lists caught my eye, Announcing CloudBase-1.1 release. Wait, wasn't Cloudbase the embedded database company that IBM acquired several years back but ended up donating the product to the Apache Software Foundation as Derby? No, not that Cloudbase. This is apparently another project that aims to provide data warehousing on top of Hadoop.
I've been watching the emergence of HBase, Hypertable and most recently the proposed incubation of Facebook's Cassandra with great interest. The first two are modeled from Google's BigTable but all are essentially horizontally scalable column oriented databases. The developers of these systems explicitly steer away having their technologies pegged as relational databases, with the refrain: "We don't do joins." What the CloudBase project aims to do is not model themselves on BigTable but to explicitly support joins between tables built on top of an HDFS cluster. It looks like they've posted extensive documentation and have released a JDBC driver, pretty cool! This is the most interesting database initiative I've seen since GreenPlum announced their support for mapreduce.
Yes, as far as scale-out data analytics, we live in interesting times.
mapreduce hadoop hbase hypertable jdbc cloudbase bigtable derby greenplum
( Dec 23 2008, 04:02:21 PM PST ) PermalinkThe new Technorati link count widget provides a way for bloggers to display how many links a blog post gets. Doing it in Roller is easy, add the velocity macro below to WEB-INF/classes/weblog.vm and then call that macro from the default blog page template (Weblog).
#macro( showCosmosLink $entry ) <script src="http://embed.technorati.com/linkcount" type="text/javascript"></script> <a href="http://technorati.com/search/$absBaseURL/page/$userName/#formatDate($plainFormat $entry.PubTime )" rel="linkcount">View blog reactions</a> #end
Tantek has more details about the release on the Technorati Blog.
( Nov 17 2006, 08:31:30 PM PST ) PermalinkSince Niall mentioned GData, I've been meaning to look into it further. Today Otis mentioned that one of the Apache/Summer of Code projects proposed is a lucene-based GData server implementation.
I took a look at the docs and realized that this is actually a really old spec, as old as the epoch as a matter of fact. Check it out:
But seriously folks, the G-man and his crue have done a fine job providing client implementations (as long as your not waiting on Ruby or one of the P-languages, no perl, python or php yet). Even a nice set of examples for the Java implementation. Thanks, G!
google gdata lucene summerofcode ruby perl python php
( Apr 25 2006, 08:17:45 PM PDT ) PermalinkA few months ago, I mused that we should be able to abandon FastCGI (with extreme prejudice) and use AJP13 with Ruby on Rails instead. Well, unbeknownst to me at the time, someone was hatching just such a plot, the Ruby/AJP Project! I'd heard last month that David Andersen was tinkering on installing it... well, he not only got it online but he blogged how he did it. Take a look at his compile time and run time configuration details using Apache 2.2's native AJP13 protocol plugin for mod_proxy (i.e. no mod_jk, good riddance), it's really cool! Way to go, David!
rubyonrails ruby ajp13 apache mod_proxy mod_jk
( Apr 24 2006, 08:48:30 PM PDT ) PermalinkThe Java backlash that began a few years ago was mostly a J2EE backlash, not against the Java language per se. Too many people took the blueprints too seriously, too literally or just too damned religiously. Too many applications that didn't need EJBs were using them, letting the container manage low level application plumbing invited slow and buggy behaviors that were painful to debug. The backlash has made a lot Perl/Python/PHP enthusiasts express self-righteous vindication and have helped morph the J2EE backlash into a broader Java backlash. Geez, even IBM is getting all spun up on PHP, whodathunk? But I think the dismissal of Java is premature. None of the P languages or Java are without hazards. These days a lot of developers are over the blueprint kool-aid and are standardizing on a simplified and productive stack:
To really bring rapid development and prototyping to a Java environment, there a lot options to look at such as dynamic JVM languages:
I expect in the months ahead to be writing applications with plugin support and that the big win for the dynamic JVM languages for me will be in easing the rapid development of plugins. In other words, I probably wouldn't write an end to end application with them but given a set of interfaces for extension points that can be automatically tested, writing the extensions in JRuby or Groovy sounds compelling.
I actually haven't had time and opportunity to substantially try half the things I've mentioned thus far. Surveying the number of tools, languages and frameworks it's clear that there are a lot of things to consider and that a lot people are concerned with (and working hard on) bringing the down the high ceremony of Java. I'll still be using P languages in the future, too. Down the road, I suspect virtual machines (JVM? parrot? mono/CLR?) will make a lot of these issues fade away and the questions at hand will be around when to use closures and when to use objects, when to annotate and when to externally declare, when to explicitly type or auto-type and so forth. The languages will be incidental as they support shared constructs and virtual machines.
java rubyonrails groovy jruby jython maven eclipse xdoclet springframework programming
( Apr 21 2006, 09:03:42 PM PDT ) PermalinkI'd never dug into where velocity's annoying messages were coming from but I decided enough is enough already. These tiresome messages from velocity were showing up on every page load:
2006-02-22 12:08:02 StandardContext[/webapp] Velocity [info] ResourceManager : found /path/to/resource.vm with loader org.apache.velocity.tools.view.servlet.WebappLoaderSuch messages might be good for debugging your setup but once you're up and running, they're just obnoxious. They definitely weren't coming from the
log4j.properties
in the webapp. So I took a look at velocity's defaults. The logging properties that velocity ships with in velocity.properties
concern display of stacktraces but the constant chatter in Tomcat's logs weren't in there either. So I unwrapped the velocity source and found it in org.apache.velocity.runtime.RuntimeConstants -- all I had to do is add this to velocity.properties
and there was peace:
resource.manager.logwhenfound = falseAh, much better!
They shoulda named that property resource.manager.cmon.feel.the.noise
, seriously.
I did a double take on this:
HashSet set = new HashSet(); set.add(new URL("http://postsecret.blogspot.com")); set.add(new URL("http://dorion.blogspot.com")); for (Iterator it = set.iterator(); it.hasNext();) { System.out.println(it.next()); }I was expecting to get output like
http://postsecret.blogspot.com http://dorion.blogspot.comor
http://dorion.blogspot.com http://postsecret.blogspot.comBut all that I got was
http://postsecret.blogspot.com
Hmmm....
The java.net.URL javadoc says what I'd expect "Creates an integer suitable for hash table indexing." So I tried this:
URL url1 = new URL("http://postsecret.blogspot.com"); URL url2 = new URL("http://dorion.blogspot.com"); System.out.println(url1.hashCode() + " " + url1); System.out.println(url2.hashCode() + " " + url2);and got this
1117198397 http://postsecret.blogspot.com 1117198397 http://dorion.blogspot.comI was expecting different hashCode's. Either java.net.URL is busted or I'm blowing it and my understanding of the contract with java.lang.Object and its hashCode() method is busted. ( Feb 13 2006, 07:37:29 PM PST ) Permalink
One of the really wonderful and evil things about Perl is the tie interface. You get a persistent hash without writing a boat load of code. With Sleepycat's BerkeleyDB Java Edition you can do something very similar.
Here's a quick re-cap: I've mentioned fiddling with BerkeleyDB-JE before with a crude "hello world" app. You can use the native code version with Perl with obscene simplicity, too. In years past, I enjoyed excellent performance with older versions of BerkeleyDB that used a class called "DB_File" -- today, the thing to use is the "BerkeleyDB" library off of CPAN (note, you need db4.x+ something for this to work). Here's a sample that writes to a BDB:
#!/usr/bin/perl use BerkeleyDB; use Time::HiRes qw(gettimeofday); use strict; my $filename = '/var/tmp/bdbtest'; my %hash = (); tie(%hash, 'BerkeleyDB::Hash', { -Filename => $filename, -Flags => DB_CREATE }); $hash{'539'} = "d\t" . join('',@{[gettimeofday]}) . "\tu\thttp://www.sifry.com/alerts"; $hash{'540'} = "d\t" . join('',@{[gettimeofday]}) . "\tu\thttp://epeus.blogspot.com"; $hash{'541'} = "d\t" . join('',@{[gettimeofday]}) . "\tu\thttp://http://joi.ito.com"; untie(%hash);Yes, I'm intentionally using plain old strings, not Storable, FreezeThaw or any of that stuff.
#!/usr/bin/perl use BerkeleyDB; use strict; my $filename = '/var/tmp/bdbtest'; my %hash = (); tie(%hash, 'BerkeleyDB::Hash', { -Filename => $filename, -Flags => DB_RDONLY }); for my $bid (keys %hash) { my %blog = split(/\t/,$hash{$bid}); print "$bid:\n"; while(my($k,$v) = each(%blog)) { print "\t$k => $v\n"; } } untie(%hash);Which would render output like this:
541: u => http://http://joi.ito.com d => 1139388034903283 539: u => http://www.sifry.com/alerts d => 1139388034902888 540: u => http://epeus.blogspot.com d => 1139388034903227
Java has no tie operator (that's probably a good thing). But Sleepycat has incorporated a Collections framework that's pretty cool and gets you pretty close to tied hash functionality. Note however that it's not entirely compatible with the interfaces in the Java Collections Framework but if you know those APIs, you'll immediately know the Sleepycat APIs.
com.sleepycat.collections.StoredMap
implements java.util.Map
with the folloing cavaets:
java.util.Iterator
s that have been working on a StoredMap, you have to use com.sleepycat.collections.StoredIterator
's .close(Iterator) method to tidy up.So what does the code look like? Well, let's say you wanted to store a bunch of these vanilla beans in the database:
public final class ImmutableBlog implements Serializable { private static final long serialVersionUID = -7882532723565612191L; private long lastmodified; private String url; private int id; public ImmutableBlog(final int id, final long lastmodified, final String url) { this.id = id; this.lastmodified = lastmodified; this.url = url; } public int getId() { return id; } public long getLastmodified() { return lastmodified; } public String getUrl() { return url; } public boolean equals(Object o) { if (!(o instanceof ImmutableBlog)) return false; if (o == this) return true; ImmutableBlog other = (ImmutableBlog)o; return other.getId() == this.getId() && other.getLastmodified() == this.getLastmodified() && other.getUrl().equals(this.getUrl()); } public int hashCode() { return (int) (id * 51 + url.hashCode() * 17 + lastmodified * 29); } public String toString() { StringBuffer sb = new StringBuffer(this.getClass().getName()); sb.append("[id=") .append(id) .append(",lastmodified=") .append(lastmodified) .append(",url=") .append(url) .append("]"); return sb.toString(); } }note that it implements java.io.Serializable
public class StoredBlogMap { private StoredMap blogMap; public StoredBlogMap() throws Exception { init(); } protected void init() throws Exception { File dir = new File(System.getProperty("java.io.tmpdir") + File.separator + "StoredBlogMap"); dir.mkdirs(); EnvironmentConfig envConfig = new EnvironmentConfig(); envConfig.setAllowCreate(true); Environment env = new Environment(dir, envConfig); DatabaseConfig dbConfig = new DatabaseConfig(); dbConfig.setAllowCreate(true); Database blogsdb = env.openDatabase(null, "blogsdb", dbConfig); Database classdb = env.openDatabase(null, "classes", dbConfig); StoredClassCatalog catalog = new StoredClassCatalog(classdb); blogMap = new StoredMap(blogsdb, new IntegerBinding(), new SerialBinding(catalog, ImmutableBlog.class), true); } public Map getBlogMap() { return blogMap; } }The majority of the code is just plumbing for setting up the underlying database and typing the keys and values.
public class StoredBlogMapTest extends TestCase { private static Map testMap; static { testMap = new HashMap(); testMap.put(new Integer(539), new ImmutableBlog(539, System.currentTimeMillis(), "http://www.sifry.com/alerts")); testMap.put(new Integer(540), new ImmutableBlog(540, System.currentTimeMillis(), "http://epeus.blogspot.com")); testMap.put(new Integer(541), new ImmutableBlog(541, System.currentTimeMillis(), "http://www.arachna.com/roller/page/spidaman")); }; private StoredBlogMap blogMap; protected void setUp() throws Exception { super.setUp(); blogMap = new StoredBlogMap(); } public void testWriteBlogs() throws Exception { Map blogs = blogMap.getBlogMap(); for (Iterator iter = testMap.entrySet().iterator(); iter.hasNext();) { Map.Entry ent = (Map.Entry) iter.next(); blogs.put((Integer)ent.getKey(), (ImmutableBlog)ent.getValue()); } int i = 0; for (Iterator iter = blogMap.getBlogMap().keySet().iterator(); iter.hasNext();) { iter.next(); i++; } assertEquals(testMap.size(), i); } public void testReadBlogs() throws Exception { Map blogs = blogMap.getBlogMap(); Iterator iter = blogs.entrySet().iterator(); while (iter.hasNext()) { Map.Entry ent = (Map.Entry) iter.next(); ImmutableBlog test = (ImmutableBlog) testMap.get(ent.getKey()); ImmutableBlog stored = (ImmutableBlog) ent.getValue(); assertEquals(test, stored); } StoredIterator.close(iter); } public static void main(String[] args) { junit.textui.TestRunner.run(StoredBlogMapTest.class); } }These assertions all succeed, so assigning to and fetching from a persistent Map works! One of the notable things about the BDB library, it will allocate generous portions of the heap if you let it. The upside is that you get very high performance from the BDB cache. The downside is... using up heap that other things want. This is tunable, in the StoredBlogMap ctor, add this:
// cache size is the number of bytes to allow Sleepycat to nail up envConfig.setCacheSize(cacheSize); // ... now setup the Environment
The basic stuff here functions very well, however I haven't run the any production code that uses Sleepycat's Collections yet. My last project with BDB needed to run an asynchronous database entry remover, so I wanted to remove as much "padding" as possible.
( Feb 08 2006, 12:22:21 AM PST ) PermalinkNote to self: If you're getting OutOfMemoryError's, bumping up the heap size may actually make the problem worse. Usually, OOM means you've exceeded the JVM's capacity... so you set -Xms and -Xmx to a higher strata of memory allocation. Well, at least I thought that was the conventional wisdom. Having cranked it up to 1850M to open very large data structures, OOM's were still bringing down the house. OK, spread the work around in smaller chunks across multiple JVMs. But it still bombs out. It turns out that you have to be very particular about giving the JVM a lot of heap up front. This set of posts seems to peg it. I'd figured that nailing a big heap allocation was how I'd prevent OOM'ing. Looks like it's time for me to bone up on JVM tuning. I should probably dig into 64-bit Java while I'm at it.
( Feb 01 2006, 11:57:15 PM PST ) PermalinkI finally succumbed to Apple's pleas to update Tiger on my powerbook to 10.4.4. Maven was dumping hotspot errors and Eclipse was misbehaving, so an update seemed in order. Well, when the system came up, my menu bar items (clock, battery status, wifi status, speaker volume, etc) were gone! The network settings were goofed up and I had this profound flash of regret that I hadn't done a backup before doing the update.
Thankfully, Mike Hoover and davidx (co-horts at Technorati) were on hand to assist and dig up the following factoid:
/System/Library/CoreServices/Search.bundle
sudo mv /System/Library/CoreServices/Search.bundle /var/tmp/
, then rebootedapple powerbook java eclipse maven macosx technorati
( Jan 27 2006, 11:39:48 AM PST ) PermalinkA lightweight build system should be able to run a project's test harness quickly so that developers can validate their work and promptly move on to the next thing. Each test should, in theory, stand alone and not require the outcome of prior tests. But if testing the application requires setting up a lot of data to run against, the can run into a fundamental conflict with the practical. How does it go? "The difference between theory and practice is different in theory and in practice."
Recently I've been developing a caching subsystem that should support fixed size LRU's, expiration and so forth. I'd rather re-use the data that I already have in the other test's data set -- there are existing tests that exercise the data writer and reader classes. For my cache manager class, I started off the testing with a simple test case that creates a synthetic entity, a mock object, and validates that it can store and fetch as well as store and lazily expire the object. Great, that was easy!
What about the case of putting a lot of objects in the cache and expiring the oldest entries? What about putting a lot of objects in the cache and test fetching them while the expiration thread is concurrently removing expired entries? Testing the multi-threaded behavior is already a sufficient PITA, having to synthesize a legion of mock objects means more code to maintain -- elsewhere in the build system I have classes that the tests verify can access legions of objects, why not use that? The best code is the code that you don't have to maintain.
<sigh />
I want to be agile, I want to reuse and maintain less code and I want the test harness to run quickly. Is that too much to ask?
My take on this is that agile methodologies are composed of a set practices and principles that promote (among other things) flexible, confident and collaborative development. Working in a small startup, as I do at Technorati, all three are vital to our technical execution. I have a dogma about confidence:
Lately I've been favoring maven for build management (complete with all of it's project lifecycle goodies). Maven gives me less build code to maintain (less build.xml stuff). However, one thing that's really jumped in my way is that in the project.xml file, there's only one way and one place to define how to run the tests. This is a problem that highlights one of the key tensions with TDD: from a purist standpoint, that's correct; there should be one test harness that runs each test case in isolation of the rest. But in my experience, projects usually have different levels of capability and integration that require a choice, either:
I ended up writing an ant test runner that maven invokes after the database is setup. Each set of tests that transitions the data to a known state lays the ground work for the next set of tests. Perhaps I'd feel differently about it if I had more success with DBUnit or had a mock-object generator that could materialize classes pre-populated with desired data states. In the meantime, my test harness runs three times faster and there's less build plumbing (which is code) to maintain had I adhered to the TDD dogma.
ant maven tdd refactoring unit testing agile java technorati
( Jan 26 2006, 06:58:22 PM PST ) PermalinkTo scale the SQL query load on a database, it's a common practice to do writes to the master but query replication slaves for reads. If you're not sure what that's about and you have a pressing need to scale your MySQL query load, then stop what you're doing and buy Jeremy Zawodny's book High Performance MySQL.
If you've used MySQL replication and written application code that dispatches INSERTS, UPDATES and DELETES to the master while sending SELECTS to the slaves (exception for transactional operations where those have to go to the master), you know how it can add another wrinkle of complexity. Well, apparently there's been a little help in the MySQL JDBC driver for a while and I'm just learning of it now. The ReplicationConnection class in the MySQL Connector/J jar (as of v3.1.7) provides the dispatching pretty transparently.When the state of the readOnly flag is flipped on the ReplicationConnection, it changes the connection accordingly. It will even load balance across multiple slaves. Where a normal JDBC connection to a MySQL database might look like this
Class.forName("com.mysql.jdbc.Driver"); Connection conn = DriverManager. getConnection("jdbc:mysql://localhost/test", "scott", "tiger");You'd connect with ReplicationDriver this way
ReplicationDriver driver = new ReplicationDriver(); Connection conn = driver.connect("jdbc:mysql://master,slave1,slave2,slave3/test", props); conn.setReadOnly(false); // do stuff on the master conn.setReadOnly(true); // now do SELECTs on the slavesand ReplicationDriver handles all of the magic of dispatching. The full deal is in the Connector/J docs, I was just pleased to finally find it!
I know of similar efforts in Perl like DBD::Multiplex and Class::DBI::Replication but I haven't had time or opportunity. Brad Fitzpatrick has discussed how LiveJournal handles connection management (there was a slide mentioning this at OSCON last August). LiveJournal definitely takes advantage of using MySQL as a distributed database but I haven't dug into LJ's code looking for it either. In the ebb and flow of my use Perl, it is definitely ebbing these days.
mysql database replication jdbc perl DBI
( Jan 17 2006, 11:42:09 PM PST ) PermalinkThere is widespread frustration with standards that try to boil the ocean of software problems that are out there to solve. Tim Bray has a sound advice:
If you're going to be designing a new XML language, first of all, consider not doing it.In his discussion of Minimalism vs. Completeness he quotes Gall's Law:
A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.The tendency to inflate standards is similar to software development featuritus. I'm oft heard to utter the refrain, "Let's practice getting above the atmosphere before shooting for the moon." The scope of what is "complete" is likely to change 20% along the way towards getting there. The basic idea is to aim for sufficiency, not completeness; simplicity and extensibility are usually divergent. Part of the engineering art is to find as much of both as possible.
On the flip side, where completeness is an explicit upfront goal, there are internal tensions there as well. Either building for as many of the anticipated needs as possible or a profound commitment to refactoring has to be reckoned with. The danger of only implementing the simplist thing without a commitment to refactoring is that expediency tends to lead people, particularly if they haven't solved that type of problem before, to do the easy but counter-productive thing: taking short cuts, cutting and pasting and hard coding random magic doodads. As long as there is a commitment to refactoring, software atrophy can be combatted. Reducing duplication, separating concerns and coding to interfaces enables software to grow without declining in comprehensibility. Throw in a little test-driven development and you've got a lot of the standard shtick for agility.
Even though there's a project at work that I've been working on mostly solo, it's built for agility. The build system is relatively minimal thanks to maven. The core APIs and service interfaces (which favors simplicity: REST) are unit tested and the whole thing is monitored under CruiseControl to keep it all honest. This actually saved us the other day when a collaborator needed additional data included in the API's return values. He did the simplest thing (good) but I promptly got an email from CruiseControl that the build was broken. I reviewed his check-in and refactored it by moving the code that was put in-line in the method and moving it do it's own. I wrote a test for the method that fetches the additional data. And then wrote one for the original method's responses to include the additional data. The original method then acquired a flag to indicate whether the responses should be dressed up with this additional data; not all clients need it and it requires a round-trip to another data repository, making it a parameter makes sense since the applications that don't need it are performance sensitive. Afterwards, the code enjoyed additional benefits in that the caching had granularity that matched the distibution of the data sources. Getting the next mail from CruiseControl that it was happy with the build was very gratifying. I need to test-infect my colleagues so they learn to enjoy the same pavlovian response.
Anyway. I'm short on sleep and long on rambles this morning.
There are times when simple problems are mired in seemingly endless hand wringing and you have to stand up to shout JFDI. The Java software world, like RDF theorists and other parochial ivory tower clubs, seems to have a bad case of specificationitus. There are over 300 JSR's. Do we need all of those? On the other hand, great software is generally not created in a burst of a hackathon. There's no doubt that when a project has fallen into quicksand, getting all parties around a table and getting it out is an important way to clear the path. Rapid prototyping is often best accomplished in a focused push. I like prototyping to be used as a warm up exercise. If you can practice getting lift-off on a problem and you can attain high altitudes with some simple efforts, you're likelihood of making it to the moon increases.
agile refactoring technorati maven unit testing
( Jan 10 2006, 07:59:45 AM PST ) PermalinkLooks like I better hasten my effort to upgrade to Roller 2.x. This (v1.1) installation hit an OutOfMemoryError a little while ago and crashed the JVM in all of its hotspot glory. I'm suspicious of the caching implementation in Roller (IIRC, it's OSCache). For a non-clustered installation, plain-old-filesystem caches JFW. For distributed caches, JFW applies to memcached. We've been using the Java clients (and Perl and Python) for memcached productively for a long time now. Interestingly, some one was inspired to write a Java port of the memcached server. Crazy! And I think to myself, what a wonderful world.
( Jan 09 2006, 10:20:03 PM PST ) PermalinkThe levers and dials of character set encoding can be overwhelming, just looking at the matrix supported by J2SE 1.4.2 gives me vertigo. Java's encoding conversion support is simple enough, if not garrulous:
String iso88591String = request.getParameter("q"); String utf8String = new String(iso88591String.getBytes("UTF-8"));But what do you do if you don't know what encoding you're dealing with to begin with? It looks as though there are a couple of ways to do it:
String q_unknown_japanese = request.getParameter(q); String q_unicode = new String(q_unknown.getBytes("ISO8859_1"),"JISAutoDetect");