Last night I attended my first OK.rb meeting. I moved back to the OKC area from New York several months ago, but it took me some time to get things in order before I could start attending. You know – kids, man.
Anyway, I have nothing but good things to say about this group. It’s full of smarts and enthusiasm – the sort of thing any hacker-type should be surrounding himself with. Even better, they work on actual code problems. It’s not just fanboy hour; they are actually examining problems and working them out together. A great way to sharpen skills for sure.
The problem presented was clear: understand this code, then optimize it.
# code example provided by James Gray II (http://twitter.com/jeg2)
DICTIONARY = "/usr/share/dict/words"
class String
def to_signature
downcase.delete("^a-z").split("").sort.join
end
end
scrambled = ARGV.first.to_s.to_signature
abort "USAGE: #{$PROGRAM_NAME} SCRAMBLED_WORD" if scrambled.empty?
File.foreach(DICTIONARY) do |word|
puts word if word.to_signature == scrambled
end
The code is clear enough: we’re monkey patching the String class, allowing us to generate a simple signature for a given word. We convert the word to lowercase, rip out any non-alphabetic characters, and then place the characters in alphabetically order.
An example of it in action via irb:
ruby-1.9.2-p180 :008 > "test".to_signature
=> "estt"
It’s worth noting that signatures are not necessarily unique:
ruby-1.9.2-p180 :010 > "host".to_signature
=> "host"
ruby-1.9.2-p180 :011 > "shot".to_signature
=> "host"
So, our code takes a word as an input, obtains its signature, and then finds every word in the dictionary which has the same signature – a very brute force approach, no? This leads to the next part: optimization.
The first step is to start measuring execution time. I’m paraphrasing, but our first move was to wrap our algorithm in a method and include some simple debug timing:
DICTIONARY = "/usr/share/dict/words"
class String
def to_signature
downcase.delete("^a-z").split("").sort.join
end
end
def descramble(scrambled)
File.foreach(DICTIONARY) do |word|
puts word if word.to_signature == scrambled
end
end
scrambled = ARGV.first.to_s.to_signature
abort "USAGE: #{$PROGRAM_NAME} SCRAMBLED_WORD" if scrambled.empty?
start = Time.now
descramble(scrambled)
puts Time.now - start
An initial run looks like this:
$ ruby descramble.rb test
sett
stet
test
2.703979
Alright. Our algorithm takes 2.7 seconds to brute force the word list.
The first improvement that came to the group mind was to see if the first word of each dictionary word was even included in our signature. If it is not, there is no need to run the expensive to_signature method.
def descramble(sig)
File.foreach(DICTIONARY) do |word|
next unless scrambled.include? word[0]
puts word if word.to_signature == sig
end
end
A clear win:
$ ruby descramble.rb test
sett
stet
test
0.594181
Nice! That’s about 4.5 times faster than the original.
As the session wore on, we were able to work out some seeming improvements, but nothing that made a huge difference. The only other thing we discussed would be to pre-index the word file, but that seemed to be cheating.
In the real world, though, I don’t think that there really is such a thing. We want to make this thing fast, and the only way we’re really going to do that is to index the file before hand. Upon returning home, I quickly whipped up a solution using Redis as my data store.
First, we pre-load the data:
# cheat.rb - preload our word data into redis sets, using the signature as the key
require 'redis'
DICTIONARY = "/usr/share/dict/words"
class String
def to_signature
downcase.delete("^a-z").split("").sort.join
end
end
# our interface to the local redis instance
r = Redis.new
# go ahead and clear out anything that we might have in here
r.flushall
File.foreach(DICTIONARY) do |word|
r.sadd word.to_signature, word
end
That takes about 17 seconds to run, but we only have to do it once. After that, the indexed data persists and we can query it as needed.
Our app changes to rely upon Redis:
require 'redis'
class String
def to_signature
downcase.delete("^a-z").split("").sort.join
end
end
def descramble(sig)
r = Redis.new
puts r.smembers sig
end
scrambled = ARGV.first.to_s.to_signature
abort "USAGE: #{$PROGRAM_NAME} SCRAMBLED_WORD" if scrambled.empty?
start = Time.now
descramble(scrambled)
puts Time.now - start
The performance improvement is obvious:
$ ruby descramble.rb test
test
stet
sett
0.001002
0.001002 seconds! That’s nearly 2,700 times as fast as the original piece of code. Ah, the spoils of cheating!
Of course, Redis is not the only choice here, but given the scope of the problem, it seemed like a fair pick. It is also such a cheap tool, meaning that there isn’t much to getting it running and putting it to quick use.