PHP - What is a good way to produce a short alphanumeric string from a long md5 hash?

Here's a little function for consideration:

/** Return 22-char compressed version of 32-char hex string (eg from PHP md5). */
function compress_md5($md5_hash_str) {
    // (we start with 32-char $md5_hash_str eg "a7d2cd9e0e09bebb6a520af48205ced1")
    $md5_bin_str = "";
    foreach (str_split($md5_hash_str, 2) as $byte_str) { // ("a7", "d2", ...)
        $md5_bin_str .= chr(hexdec($byte_str));
    }
    // ($md5_bin_str is now a 16-byte string equivalent to $md5_hash_str)
    $md5_b64_str = base64_encode($md5_bin_str);
    // (now it's a 24-char string version of $md5_hash_str eg "VUDNng4JvrtqUgr0QwXOIg==")
    $md5_b64_str = substr($md5_b64_str, 0, 22);
    // (but we know the last two chars will be ==, so drop them eg "VUDNng4JvrtqUgr0QwXOIg")
    $url_safe_str = str_replace(array("+", "/"), array("-", "_"), $md5_b64_str);
    // (Base64 includes two non-URL safe chars, so we replace them with safe ones)
    return $url_safe_str;
}

Basically you have 16-bytes of data in the MD5 hash string. It's 32 chars long because each byte is encoded as 2 hex digits (i.e. 00-FF). So we break them up into bytes and build up a 16-byte string of it. But because this is no longer human-readable or valid ASCII, we base-64 encode it back to readable chars. But since base-64 results in ~4/3 expansion (we only output 6 bits per 8 bits of input, thus requiring 32 bits to encode 24 bits), the 16-bytes becomes 22 bytes. But because base-64 encoding typically pads to lengths multiples of 4, we can take only the first 22 chars of the 24 character output (the last 2 of which are padding). Then we replace non-URL-safe characters used by base-64 encoding with URL-safe equivalents.

This is fully reversible, but that is left as an exercise to the reader.

I think this is the best you can do, unless you don't care about human-readable/ASCII, in which case you can just use $md5_bin_str directly.

And also you can use a prefix or other subset of the result from this function if you don't need to preserve all the bits. Throwing out data is obviously the simplest way to shorten things! (But then it's not reversible)

P.S. for your input of "a7d2cd9e0e09bebb6a520af48205ced1" (32 chars), this function will return "VUDNng4JvrtqUgr0QwXO0Q" (22 chars).


Here are two conversion functions for Base-16 to Base-64 conversion and the inverse Base-64 to Base-16 for arbitrary input lengths:

function base16_to_base64($base16) {
    return base64_encode(pack('H*', $base16));
}
function base64_to_base16($base64) {
    return implode('', unpack('H*', base64_decode($base64)));
}

If you need Base-64 encoding with the URL and filename safe alphabet , you can use these functions:

function base64_to_base64safe($base64) {
    return strtr($base64, '+/', '-_');
}
function base64safe_to_base64($base64safe) {
    return strtr($base64safe, '-_', '+/');
}

If you now want a function to compress your hexadecimal MD5 values using URL safe characters, you can use this:

function compress_hash($hash) {
    return base64_to_base64safe(rtrim(base16_to_base64($hash), '='));
}

And the inverse function:

function uncompress_hash($hash) {
    return base64_to_base16(base64safe_to_base64($hash));
}

Of course if I want a function to satisfy my needs perfectly I better make it myself. Here is what I came up with.

//takes a string input, int length and optionally a string charset
//returns a hash 'length' digits long made up of characters a-z,A-Z,0-9 or those specified by charset
function custom_hash($input, $length, $charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUFWXIZ0123456789'){
    $output = '';
    $input = md5($input); //this gives us a nice random hex string regardless of input 

    do{
        foreach (str_split($input,8) as $chunk){
            srand(hexdec($chunk));
            $output .= substr($charset, rand(0,strlen($charset)), 1);
        }
        $input = md5($input);

    } while(strlen($output) < $length);

    return substr($output,0,$length);
}

This is a very general purpose random string generator, however it is not just any old random string generator because the result is determined by the input string and any slight change to that input will produce a totally different result. You can do all sort of things with this:

custom_hash('1d34ecc818c4d50e788f0e7a9fd33662', 16); // 9FezqfFBIjbEWOdR
custom_hash('Bilbo Baggins', 5, '0123456789bcdfghjklmnpqrstvwxyz'); // lv4hb
custom_hash('', 100, '01'); 
// 1101011010110001100011111110100100101011001011010000101010010011000110000001010100111000100010101101

Anyone see any problems with it or any room for improvement?

Tags:

Php

Random

Base