perl: String hashCode

I had a need to create unique numbers out of strings. So, I explored the option of using hashcodes (which won’t be 100% unique especially if they need to be represented in limited number of bits). I couldn’t find any function suitable within perlfunc. After doing a bit of research on the web and not wanting to spend too much time understanding the theory, I just looked at java.lang.String source code and the hashcode function was like

hash = 0;
for(character in string) {
hash = 31*hash+character;
}

So far so good. When I tried this with perl, I got completely different answer. So, it turns out this is because, by default perl doesn’t use integer arithmetic in calculations. But you can force it to do so with “use integer;” and most importantly, this behavior can be controlled by block. So, here is my final perl String hashCode function that is same as Java’s implementation of a String hashCode.

sub hashCode {
my $hash = 0;
use integer;
foreach(split //,shift) {
$hash = 31*$hash+ord($_);
}
return $hash;
}

That’s it.

Advertisements

3 Comments

Filed under perl

3 responses to “perl: String hashCode

  1. nid

    64bit os and 32bit os will return different values. so bad news.

    • Jacob

      If anyone else needs to run this on a 64 bit machines the following works for me.
      This code should also work on a 32 bit machine but I haven’t tested it.

      sub truncateToInt32
      {
      use integer;
      my $NEGATIVE_BIT = 2 ** 31;
      my $MAX_POSITIVE = 2 ** 31 – 1;
      my $val = shift;
      if ($NEGATIVE_BIT < 0) {
      # 32 bit machine so don't do anything.
      return $val;
      }
      my $truncated = $val & $MAX_POSITIVE;
      if (($val & $NEGATIVE_BIT) != 0) {
      $truncated -= $NEGATIVE_BIT;
      }
      return $truncated;
      }

      sub hashCode {
      use integer;
      my $hash = 0;
      foreach(split //, shift) {
      $hash = truncateToInt32(31 * $hash);
      $hash = truncateToInt32($hash + ord($_));
      }
      return $hash;
      }

  2. I found this post quite helpful, but unfortuantely for some of my test cases it failed to match java. However, it got me far enough going in the “right” direction that I produced something that does work for all of my test cases:

    perl -le ‘$h = unpack “i”, pack “i”, ($_ + unpack “i”, pack “i”, $h*31) foreach unpack “C*”, $ARGV[0]; print $h >= 0 ? $h : -$h’ /path/to/eclipse

    A summary of my “improvements” (take that for what it’s worth):
    1) hashcode is always a postive number, but the algorithm can produce negative numbers. 2) unpack/pack with the “i” flag handles all the messiness of wrapping around 32-bit numbers. 3) using unpack “C*” instead of split//, should be significantly faster… additionally, it returns each byte of the string as their numerical value, so you can avoid the subsequent call to ord.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s