17 July 2008

You may, at length, find yourself in need of a pseudo unique numeric identifier of a certain length. You could just generate a random number and hope that those are distributed evenly across the output space. Another way to do this is by taking the output of an MD5 (or SHA1) hash and using some subset of its bits.

How might this go? Let's say your requirement is to generate a 12-digit identifier. To figure out how many bits of the 128-bit MD5 you need for your n-digit requirement, find the largest b such that 2^{b} < 10^{n}. In this case, 39 is the largest b that results in a number less than 10^{12}.

Once you know that, you need to actually get those bits off of the hash. One way to do this is to take the last 10 characters of the MD5 hash, convert it into a numeric representation, and bitwise AND it with 0x7FFFFFFFFF.

With something like Ruby, you could simply do (0xDEADBEEF08 & 0x7fffffffff) to get a result. I was actually doing this in VB6 and couldn't bitwise AND anything but ints and so ended up taking each hexadecimal digit and multiplying it by 16^{place}, where place is 0 for the rightmost digit (and 1 for the next digit, etc).

According to the birthday paradox, you can expect your first duplicate after about 2^{39/2} calls.

I ran some tests comparing the number of collisions you get with this method versus simply generating a random number and this produced about 98 duplicates in 10 million tries, whereas the random number scheme produced something like 20,000.