Return a new string that sorts between two given strings

Here is an equivalent function to m69's answer implemented directly in my PostgreSQL database, with PL/pgSQL:

create or replace function app_public.mid_string(prev text, next text) returns text as $$
declare
  v_p int;
  v_n int;
  v_pos int := 0;
  v_str text;
begin
  LOOP -- find leftmost non-matching character
    v_p := CASE WHEN v_pos < char_length(prev) THEN ascii(substring(prev from v_pos + 1)) ELSE 96 END;
    v_n := CASE WHEN v_pos < char_length(next) THEN ascii(substring(next from v_pos + 1)) ELSE 123 END;
    v_pos := v_pos + 1;
    EXIT WHEN NOT (v_p = v_n);
  END LOOP;
  v_str := left(prev, v_pos-1);   -- copy identical part of string
  IF v_p = 96 THEN                -- prev string equals beginning of next
    WHILE v_n = 97 LOOP           -- next character is 'a'
      -- get char from next
      v_n = CASE WHEN v_pos < char_length(next) THEN ascii(substring(next from v_pos + 1)) ELSE 123 END;
      v_str := v_str || 'a';      -- insert an 'a' to match the 'a'
      v_pos := v_pos + 1;
    END LOOP;
    IF v_n = 98 THEN              -- next character is 'b'
      v_str := v_str || 'a';      -- insert an 'a' to match the 'b'
      v_n := 123;                 -- set to end of alphabet
    END IF;
  ELSIF (v_p + 1) = v_n THEN    -- found consecutive characters
    v_str := v_str || chr(v_p); -- insert character from prev
    v_n = 123;                  -- set to end of alphabet
    v_p := CASE WHEN v_pos < char_length(prev) THEN ascii(substring(prev from v_pos + 1)) ELSE 96 END;
    WHILE v_p = 122 LOOP
      v_pos := v_pos + 1;
      v_str := v_str || 'z';    -- insert 'z' to match 'z'
      v_p := CASE WHEN v_pos < char_length(prev) THEN ascii(substring(prev from v_pos + 1)) ELSE 96 END;
    END LOOP;
  END IF;
  return v_str || chr(ceil((v_p + v_n) / 2.0)::int);
end;
$$ language plpgsql strict volatile;

Tested with this function:

create or replace function app_public.test() returns text[] as $$
declare
  v_strings text[];
  v_rnd int;
begin
  v_strings := array_append(v_strings, app_public.mid_string('', ''));

  FOR counter IN 1..100 LOOP
    v_strings := v_strings || app_public.mid_string(v_strings[counter], '');
  END LOOP;
  return v_strings;
end;
$$ language plpgsql strict volatile;

Which results in:

"strings": [
  "n",
  "u",
  "x",
  "z",
  "zn",
  "zu",
  "zx",
  "zz",
  "zzn",
  "zzu",
  "zzx",
  "zzz",
  "zzzn",
  "zzzu",
  "zzzx",
  "zzzz",
  "...etc...",
  "zzzzzzzzzzzzzzzzzzzzzzzzn",
  "zzzzzzzzzzzzzzzzzzzzzzzzu",
  "zzzzzzzzzzzzzzzzzzzzzzzzx",
  "zzzzzzzzzzzzzzzzzzzzzzzzz",
  "zzzzzzzzzzzzzzzzzzzzzzzzzn"
]

This is a very simple way to achieve this and probably far from optimal (depending on what you call optimal of course).

I use only a and b. I suppose you could generalise this to use more letters.

Two simple observations:

Creating a new string that comes after another string is easy: just append one or more letters. E.g., abba < abbab.
Creating a new string that comes before another string x is only always guaranteed to be possible if x ends with b. Now, replace that b by an a and append one or more letters. E.g., abbab > abbaab.

The algorithm is now very simple. Start with a and b as sentinels. Inserting a new key between two existing keys x and y:

If x is a prefix of y: the new key is y with the ending b replaced by ab.
If x is not a prefix of y: the new key is x with a b appended.

Example run:

a, b
a, ab*, b
a, aab*, ab, b
a, aab, ab, abb*, b
a, aab, ab, abab*, abb, b
a, aaab*, aab, ab, abab, abb, b

Minimising string length

If you want to keep the string lengths to a minimum, you could create a string that is lexicographically halfway between the left and right strings, so that there is room to insert additional strings, and only create a longer string if absolutely necessary.

I will assume an alphabet [a-z], and a lexicographical ordering where an empty space comes before 'a', so that e.g. "ab" comes before "abc".

Basic case

You start by copying the characters from the beginning of the strings, until you encounter the first difference, which could be either two different characters, or the end of the left string:

abcde ~ abchi  ->  abc  +  d ~ h  
abc   ~ abchi  ->  abc  +  _ ~ h

The new string is then created by appending the character that is halfway in the alphabet between the left character (or the beginning of the alphabet) and the right character:

abcde ~ abchi  ->  abc  +  d ~ h  ->  abcf  
abc   ~ abchi  ->  abc  +  _ ~ h  ->  abcd

Consecutive characters

If the two different characters are lexicographically consecutive, first copy the left character, and then append the character halfway between the next character from the left string and the end of the alphabet:

abhs ~ abit  ->  ab  +  h ~ i  ->  abh  +  s ~ _  ->  abhw
abh  ~ abit  ->  ab  +  h ~ i  ->  abh  +  _ ~ _  ->  abhn

If the next character(s) in the left string are one or more z's, then copy them and append the character halfway between the first non-z character and the end of the alphabet:

abhz   ~ abit  ->  ab  +  h ~ i  ->  abh  +  z ~ _  ->  abhz  +  _ ~ _  ->  abhzn  
abhzs  ~ abit  ->  ab  +  h ~ i  ->  abh  +  z ~ _  ->  abhz  +  s ~ _  ->  abhzw  
abhzz  ~ abit  ->  ab  +  h ~ i  ->  abh  +  z ~ _  ->  ... ->  abhzz  +  _ ~ _  ->  abhzzn

Right character is a or b

You should never create a string by appending an 'a' to the left string, because that would create two lexicographically consecutive strings, inbetween which no further strings could be added. The solution is to always append an additional character, halfway inbetween the beginning of the alphabet and the next character from the right string:

abc  ~ abcah   ->  abc  +  _ ~ a  ->  abca  +  _ ~ h  ->  abcad  
abc  ~ abcab   ->  abc  +  _ ~ a  ->  abca  +  _ ~ b  ->  abcaa  +  _ ~ _  ->  abcaan  
abc  ~ abcaah  ->  abc  +  _ ~ a  ->  abca  +  _ ~ a  ->  abcaa  +  _ ~ h  ->  abcaad  
abc  ~ abcb    ->  abc  +  _ ~ b  ->  abca  +  _ ~ _  ->  abcan

Code examples

Below is a code snippet which demonstrates the method. It's a bit fiddly because JavaScript, but not actually complicated. To generate a first string, call the function with two empty strings; this will generate the string "n". To insert a string before the leftmost or after the rightmost string, call the function with that string and an empty string.

function midString(prev, next) {
    var p, n, pos, str;
    for (pos = 0; p == n; pos++) {               // find leftmost non-matching character
        p = pos < prev.length ? prev.charCodeAt(pos) : 96;
        n = pos < next.length ? next.charCodeAt(pos) : 123;
    }
    str = prev.slice(0, pos - 1);                // copy identical part of string
    if (p == 96) {                               // prev string equals beginning of next
        while (n == 97) {                        // next character is 'a'
            n = pos < next.length ? next.charCodeAt(pos++) : 123;  // get char from next
            str += 'a';                          // insert an 'a' to match the 'a'
        }
        if (n == 98) {                           // next character is 'b'
            str += 'a';                          // insert an 'a' to match the 'b'
            n = 123;                             // set to end of alphabet
        }
    }
    else if (p + 1 == n) {                       // found consecutive characters
        str += String.fromCharCode(p);           // insert character from prev
        n = 123;                                 // set to end of alphabet
        while ((p = pos < prev.length ? prev.charCodeAt(pos++) : 96) == 122) {  // p='z'
            str += 'z';                          // insert 'z' to match 'z'
        }
    }
    return str + String.fromCharCode(Math.ceil((p + n) / 2)); // append middle character
}

var strings = ["", ""];
while (strings.length < 100) {
    var rnd = Math.floor(Math.random() * (strings.length - 1));
    strings.splice(rnd + 1, 0, midString(strings[rnd], strings[rnd + 1]));
    document.write(strings + "<br>");
}

Below is a straightforward translation into C. Call the function with empty null-terminated strings to generate the first string, or insert before the leftmost or after the rightmost string. The string buffer buf should be large enough to accomodate one extra character.

int midstring(const char *prev, const char *next, char *buf) {
    char p = 0, n = 0;
    int len = 0;
    while (p == n) {                                           // copy identical part
        p = prev[len] ? prev[len] : 'a' - 1;
        n = next[len] ? next[len] : 'z' + 1;
        if (p == n) buf[len++] = p;
    }
    if (p == 'a' - 1) {                                        // end of left string
        while (n == 'a') {                                     // handle a's
            buf[len++] = 'a';
            n = next[len] ? next[len] : 'z' + 1;
        }
        if (n == 'b') {                                        // handle b
            buf[len++] = 'a';
            n = 'z' + 1;
        }
    }
    else if (p + 1 == n) {                                     // consecutive characters
        n = 'z' + 1;
        buf[len++] = p;
        while ((p = prev[len] ? prev[len] : 'a' - 1) == 'z') { // handle z's
            buf[len++] = 'z';
        }
    }
    buf[len++] = n - (n - p) / 2;                              // append middle character
    buf[len] = '\0';
    return len;
}

Average string length

The best case is when the elements are inserted in random order. In practice, when generating 65,536 strings in pseudo-random order, the average string length is around 4.74 characters (the theoretical minimum, using every combination before moving to longer strings, would be 3.71).

The worst case is when inserting the elements in order, and always generating a new rightmost or leftmost string; this will lead to a recurring pattern:

n, u, x, z, zn, zu, zx, zz, zzn, zzu, zzx, zzz, zzzn, zzzu, zzzx, zzzz...  
n, g, d, b, an, ag, ad, ab, aan, aag, aad, aab, aaan, aaag, aaad, aaab...

with an extra character being added after every fourth string.

If you have an existing ordered list for which you want to generate keys, generate lexicographically equally-spaced keys with an algorithm like the one below, and then use the algorithm described above to generate a new key when inserting a new element.

The code checks how many charactes are needed, how many different characters are needed for the least significant digit, and then switches between two selections from the alphabet to get the right number of keys. E.g. keys with two character can have 676 different values, so if you ask for 1600 keys, that is 1.37 extra keys per two-character combination, so after each two-character key an additional one ('n') or two ('j','r') characters are appended, i.e.: aan ab abj abr ac acn ad adn ae aej aer af afn ... (skipping the initial 'aa').

function seqString(num) {
    var chars = Math.floor(Math.log(num) / Math.log(26)) + 1;
    var prev = Math.pow(26, chars - 1);
    var ratio = chars > 1 ? (num + 1 - prev) / prev : num;
    var part = Math.floor(ratio);
    var alpha = [partialAlphabet(part), partialAlphabet(part + 1)];
    var leap_step = ratio % 1, leap_total = 0.5;
    var first = true;
    var strings = [];
    generateStrings(chars - 1, "");
    return strings;

    function generateStrings(full, str) {
        if (full) {
            for (var i = 0; i < 26; i++) {
                generateStrings(full - 1, str + String.fromCharCode(97 + i));
            }
        }
        else {
            if (!first) strings.push(stripTrailingAs(str));
            else first = false;
            var leap = Math.floor(leap_total += leap_step);
            leap_total %= 1;
            for (var i = 0; i < part + leap; i++) {
                strings.push(str + alpha[leap][i]);
            }
        }
    }
    function stripTrailingAs(str) {
        var last = str.length - 1;
        while (str.charAt(last) == 'a') --last;
        return str.slice(0, last + 1);
    }
    function partialAlphabet(num) {
        var magic = [0, 4096, 65792, 528416, 1081872, 2167048, 2376776, 4756004,
                     4794660, 5411476, 9775442, 11097386, 11184810, 22369621];
        var bits = num < 13 ? magic[num] : 33554431 - magic[25 - num];
        var chars = [];
        for (var i = 1; i < 26; i++, bits >>= 1) {
            if (bits & 1) chars.push(String.fromCharCode(97 + i));
        }
        return chars;
    }

}
document.write(seqString(1600).join(' '));

Return a new string that sorts between two given strings

Tags:

Algorithm

Sorting

String

Related

Recent Posts