Friday, June 22, 2007

Sorting serialized objects from a YAML file, Ruby vs Java

I recently had a large dataset that I needed to sort, basically a bunch of objects with 9 string attributes. I dumped them to a YAML file so I could benchmark various aspects of sorting (Class#to_yaml really rocks, really). As it turns out Enumerable#sort_by was the more efficient way to go rather than Enumerable#sort (check it). The dataset contained 5429 unique objects, here's the benchmarking I did in irb:

>> bm do |x|
?>"all"){results.sort{|a,b|a.account_name <=> b.account_name}}
>> end
user system total real
all 0.070000 0.000000 0.070000 ( 0.073835)
>> bm do |x|
>> end
user system total real
all 0.020000 0.000000 0.020000 ( 0.020745)

Noticeable difference between the two methods, and quite pleasing to see how fast sort_by performed (Apple MBP 2.33GHZ 2GB RAM Ruby 1.85). Then the thought occurred to me, since I have this in a YAML file, I could write something really quick in Java and see how fast the sort would be by dumping it into a TreeSet.

So I set off to find out how I could marshal the data into POJOs from the YAML file. JYaml and JvYaml are the only (insofar as I looked) open source Java YAML libraries. Both seem half-baked in their own right, likely containing just the functionality the respective author needed and not much more (at least that's how it appeared). I ended up using JvYaml and had to search-replace the yaml entry identifier(",2002:map") so that JvYaml would create HashMap instances for me instead of its completely worthless PrivateType class.

From there I iterated through the maps to create the POJOs and shoved them into a HashSet. Once that was completed I created the comparator for my POJO, passed it into the TreeSet constructor and then timed an addAll giving it the entire HashSet:
#1: 55ms
#2: 28ms
#3: 27ms (pretty much constant thereafter)

Interesting results eh? The dataset I used was from a randomized generator and now that I have these initial numbers ( Ruby appears to have won this round), I want to test a larger set. And to be really fair I should do more research on benchmarking best practices (in addition to still needing to take my stats class).