How to convert an integer to the shortest url-safe string in Python?
This answer is similar in spirit to Douglas Leeder's, with the following changes:
- It doesn't use actual Base64, so there's no padding characters
Instead of converting the number first to a byte-string (base 256), it converts it directly to base 64, which has the advantage of letting you represent negative numbers using a sign character.
import string ALPHABET = string.ascii_uppercase + string.ascii_lowercase + \ string.digits + '-_' ALPHABET_REVERSE = dict((c, i) for (i, c) in enumerate(ALPHABET)) BASE = len(ALPHABET) SIGN_CHARACTER = '$' def num_encode(n): if n < 0: return SIGN_CHARACTER + num_encode(-n) s = [] while True: n, r = divmod(n, BASE) s.append(ALPHABET[r]) if n == 0: break return ''.join(reversed(s)) def num_decode(s): if s[0] == SIGN_CHARACTER: return -num_decode(s[1:]) n = 0 for c in s: n = n * BASE + ALPHABET_REVERSE[c] return n
>>> num_encode(0)
'A'
>>> num_encode(64)
'BA'
>>> num_encode(-(64**5-1))
'$_____'
A few side notes:
- You could (marginally) increase the human-readibility of the base-64 numbers by putting string.digits first in the alphabet (and making the sign character '-'); I chose the order that I did based on Python's urlsafe_b64encode.
- If you're encoding a lot of negative numbers, you could increase the efficiency by using a sign bit or one's/two's complement instead of a sign character.
- You should be able to easily adapt this code to different bases by changing the alphabet, either to restrict it to only alphanumeric characters or to add additional "URL-safe" characters.
- I would recommend against using a representation other than base 10 in URIs in most cases—it adds complexity and makes debugging harder without significant savings compared to the overhead of HTTP—unless you're going for something TinyURL-esque.
To encode n
:
data = ''
while n > 0:
data = chr(n & 255) + data
n = n >> 8
encoded = base64.urlsafe_b64encode(data).rstrip('=')
To decode s
:
data = base64.urlsafe_b64decode(s + '===')
decoded = 0
while len(data) > 0:
decoded = (decoded << 8) | ord(data[0])
data = data[1:]
In the same spirit as other for some “optimal” encoding, you can use 73 characters according to RFC 1738 (actually 74 if you count “+” as usable):
alphabet = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_`\"!$'()*,-."
encoded = ''
while n > 0:
n, r = divmod(n, len(alphabet))
encoded = alphabet[r] + encoded
and the decoding:
decoded = 0
while len(s) > 0:
decoded = decoded * len(alphabet) + alphabet.find(s[0])
s = s[1:]
You probably do not want real base64 encoding for this - it will add padding etc, potentially even resulting in larger strings than hex would for small numbers. If there's no need to interoperate with anything else, just use your own encoding. Eg. here's a function that will encode to any base (note the digits are actually stored least-significant first to avoid extra reverse() calls:
def make_encoder(baseString):
size = len(baseString)
d = dict((ch, i) for (i, ch) in enumerate(baseString)) # Map from char -> value
if len(d) != size:
raise Exception("Duplicate characters in encoding string")
def encode(x):
if x==0: return baseString[0] # Only needed if don't want '' for 0
l=[]
while x>0:
l.append(baseString[x % size])
x //= size
return ''.join(l)
def decode(s):
return sum(d[ch] * size**i for (i,ch) in enumerate(s))
return encode, decode
# Base 64 version:
encode,decode = make_encoder("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/")
assert decode(encode(435346456456)) == 435346456456
This has the advantage that you can use whatever base you want, just by adding appropriate characters to the encoder's base string.
Note that the gains for larger bases are not going to be that big however. base 64 will only reduce the size to 2/3rds of base 16 (6 bits/char instead of 4). Each doubling only adds one more bit per character. Unless you've a real need to compact things, just using hex will probably be the simplest and fastest option.
All the answers given regarding Base64 are very reasonable solutions. But they're technically incorrect. To convert an integer to the shortest URL safe string possible, what you want is base 66 (there are 66 URL safe characters).
That code looks something like this:
from io import StringIO
import urllib
BASE66_ALPHABET = u"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-_.~"
BASE = len(BASE66_ALPHABET)
def hexahexacontadecimal_encode_int(n):
if n == 0:
return BASE66_ALPHABET[0].encode('ascii')
r = StringIO()
while n:
n, t = divmod(n, BASE)
r.write(BASE66_ALPHABET[t])
return r.getvalue().encode('ascii')[::-1]
Here's a complete implementation of a scheme like this, ready to go as a pip installable package:
https://github.com/aljungberg/hhc