Ruby Hash Defaults and Group By

I do a lot of processing of legacy data, converting it to new databases. Usually this data needs to be indexed by some value, and a hash is a perfect solution. The Hash class lets you specify defaults that can save many keystrokes and make your code easier to read. Let's take a look.

Hash Defaults

Suppose you have one table of parts which has many installs. If you wanted to count part usage, you could do something like this:

installs_by_part = {}  
Install.all.each |install|  
  installs_by_part[install.part_id] ||= 0
  installs_by_part[install.part_id] += 1
end  

(Yes, if I really wanted counts, I'd use SQL, but for the sake of this post we'll ignore that.)

This works well enough. You can, however, configure the hash with a default. This shortens things a bit:

installs_by_part = Hash.new(0)  
Install.all.each |install|  
  installs_by_part[install.part_id] += 1
end  

This is definitely less cluttered.

But what if you really wanted to collect the ids of each install by part id? At first, you might try the following:

installs_by_part = {}  
Install.all.each |install|  
  installs_by_part[install.part_id] ||= []
  installs_by_part[install.part_id] << install.id
end  

Let's try the same shortening technique we used on the first example:

installs_by_part = Hash.new([])  
Install.all.each |install|  
  installs_by_part[install.part_id] << install.id
end  

But this doesn't work! Our hash remains empty. What happened?

In our first example, we used +=, which really meant:

  installs_by_part[install.part_id] = installs_by_part[install.part_id] + 1

Which, upon first use of an index, became:

  installs_by_part[install.part_id] = 0 + 1

So the expansion of += set the hash value.

But our second use of the hash default changed the following:

  installs_by_part[install.part_id] << install.id

Into this:

  [] << install.id

Nothing is being saved!

To solve this, we have to use another form of the hash default. If we provide a block to Hash.new it will be called when a key is missing. The block receives both the hash and the new key as arguments. Our second case can then become this:

installs_by_part = Hash.new { |h,k| h[k] = [] }  
Install.all.each |install|  
  installs_by_part[install.part_id] << install.id
end  

Now when the empty array is proffered it has already been saved in the hash.

This provides quite a bit of flexibility. In fact, we can combine the two forms. Suppose we wanted to count installs by their install type:

installs_by_part = Hash.new { |h,k| h[k] = Hash.new(0) }  
Install.all.each |install|  
  installs_by_part[install.part_id][install.type_id] += 1
end  

This is much shorter than the alternative, where we would be checking for existing keys twice and creating either an empty hash or array:

installs_by_part = {}  
Install.all.each |install|  
  installs_by_part[install.part_id] ||= {}
  installs_by_part[install.part_id][install.type_id] ||= 0
  installs_by_part[install.part_id][install.type_id] += 1
end  

ActiveRecord

If you are using ActiveRecord, it provides two handy methods for indexing and grouping. Combining it with the proc short form, we can easily index or group results:

parts_by_id      = Part.all.index_by(&:id)  
installs_by_part = Install.all.group_by(&:part_id)  

This performs the simple cases. If, however, you are importing raw data and do not have active record objects, or you need more fine grained control, using the hash default value is very handy.

comments powered by Disqus