Friday, June 22, 2007

Sorting serialized objects from a YAML file, Ruby vs Java

I recently had a large dataset that I needed to sort, basically a bunch of objects with 9 string attributes. I dumped them to a YAML file so I could benchmark various aspects of sorting (Class#to_yaml really rocks, really). As it turns out Enumerable#sort_by was the more efficient way to go rather than Enumerable#sort (check it). The dataset contained 5429 unique objects, here's the benchmarking I did in irb:

>> bm do |x|
?>"all"){results.sort{|a,b|a.account_name <=> b.account_name}}
>> end
user system total real
all 0.070000 0.000000 0.070000 ( 0.073835)
>> bm do |x|
>> end
user system total real
all 0.020000 0.000000 0.020000 ( 0.020745)

Noticeable difference between the two methods, and quite pleasing to see how fast sort_by performed (Apple MBP 2.33GHZ 2GB RAM Ruby 1.85). Then the thought occurred to me, since I have this in a YAML file, I could write something really quick in Java and see how fast the sort would be by dumping it into a TreeSet.

So I set off to find out how I could marshal the data into POJOs from the YAML file. JYaml and JvYaml are the only (insofar as I looked) open source Java YAML libraries. Both seem half-baked in their own right, likely containing just the functionality the respective author needed and not much more (at least that's how it appeared). I ended up using JvYaml and had to search-replace the yaml entry identifier(",2002:map") so that JvYaml would create HashMap instances for me instead of its completely worthless PrivateType class.

From there I iterated through the maps to create the POJOs and shoved them into a HashSet. Once that was completed I created the comparator for my POJO, passed it into the TreeSet constructor and then timed an addAll giving it the entire HashSet:
#1: 55ms
#2: 28ms
#3: 27ms (pretty much constant thereafter)

Interesting results eh? The dataset I used was from a randomized generator and now that I have these initial numbers ( Ruby appears to have won this round), I want to test a larger set. And to be really fair I should do more research on benchmarking best practices (in addition to still needing to take my stats class).


Mike Heath said...

Of course Ruby won. You're doing far more work in Java than you need to. What are you putting things in a HashSet and then a TreeSet? Put everything in an ArrayList and do java.util.Collections.sort(list, comparator).

I would like to see the results of that after the JIT has warmed up a bit.

RR said...

Good point. I put everything in a HashSet because I wanted to ensure ordering wasn't a factor (many entries in the file were already in order). Thus, it made sense going from an unosrted set straight to a sorted set. I forgot about Collections.sort so thanks for the reminder.

I refactored the Java test, and as expected, the results were very much more in its favor. I created an ArrayList by dumping the HashSet contents into it and then sorting it. 10 iterations produced an average of 12.6 ms sort time.

Still planning on doing this all over again with a dataset of at least 12000 elements to get a better idea of the performance spread (how much faster Java really would be).