Why do I need 'b' to encode a string with Base64?
base64 encoding takes 8-bit binary byte data and encodes it uses only the characters A-Z
, a-z
, 0-9
, +
, /
* so it can be transmitted over channels that do not preserve all 8-bits of data, such as email.
Hence, it wants a string of 8-bit bytes. You create those in Python 3 with the b''
syntax.
If you remove the b
, it becomes a string. A string is a sequence of Unicode characters. base64 has no idea what to do with Unicode data, it's not 8-bit. It's not really any bits, in fact. :-)
In your second example:
>>> encoded = base64.b64encode('data to be encoded')
All the characters fit neatly into the ASCII character set, and base64 encoding is therefore actually a bit pointless. You can convert it to ascii instead, with
>>> encoded = 'data to be encoded'.encode('ascii')
Or simpler:
>>> encoded = b'data to be encoded'
Which would be the same thing in this case.
* Most base64 flavours may also include a =
at the end as padding. In addition, some base64 variants may use characters other than +
and /
. See the Variants summary table at Wikipedia for an overview.
Short Answer
You need to push a bytes-like
object (bytes
, bytearray
, etc) to the base64.b64encode()
method. Here are two ways:
>>> import base64
>>> data = base64.b64encode(b'data to be encoded')
>>> print(data)
b'ZGF0YSB0byBiZSBlbmNvZGVk'
Or with a variable:
>>> import base64
>>> string = 'data to be encoded'
>>> data = base64.b64encode(string.encode())
>>> print(data)
b'ZGF0YSB0byBiZSBlbmNvZGVk'
Why?
In Python 3, str
objects are not C-style character arrays (so they are not byte arrays), but rather, they are data structures that do not have any inherent encoding. You can encode that string (or interpret it) in a variety of ways. The most common (and default in Python 3) is utf-8, especially since it is backwards compatible with ASCII (although, as are most widely-used encodings). That is what is happening when you take a string
and call the .encode()
method on it: Python is interpreting the string in utf-8 (the default encoding) and providing you the array of bytes that it corresponds to.
Base-64 Encoding in Python 3
Originally the question title asked about Base-64 encoding. Read on for Base-64 stuff.
base64
encoding takes 6-bit binary chunks and encodes them using the characters A-Z, a-z, 0-9, '+', '/', and '=' (some encodings use different characters in place of '+' and '/'). This is a character encoding that is based off of the mathematical construct of radix-64 or base-64 number system, but they are very different. Base-64 in math is a number system like binary or decimal, and you do this change of radix on the entire number, or (if the radix you're converting from is a power of 2 less than 64) in chunks from right to left.
In base64
encoding, the translation is done from left to right; those first 64 characters are why it is called base64
encoding. The 65th '=' symbol is used for padding, since the encoding pulls 6-bit chunks but the data it is usually meant to encode are 8-bit bytes, so sometimes there are only two or 4 bits in the last chunk.
Example:
>>> data = b'test'
>>> for byte in data:
... print(format(byte, '08b'), end=" ")
...
01110100 01100101 01110011 01110100
>>>
If you interpret that binary data as a single integer, then this is how you would convert it to base-10 and base-64 (table for base-64):
base-2: 01 110100 011001 010111 001101 110100 (base-64 grouping shown)
base-10: 1952805748
base-64: B 0 Z X N 0
base64
encoding, however, will re-group this data thusly:
base-2: 011101 000110 010101 110011 011101 00(0000) <- pad w/zeros to make a clean 6-bit chunk
base-10: 29 6 21 51 29 0
base-64: d G V z d A
So, 'B0ZXN0' is the base-64 version of our binary, mathematically speaking. However, base64
encoding has to do the encoding in the opposite direction (so the raw data is converted to 'dGVzdA') and also has a rule to tell other applications how much space is left off at the end. This is done by padding the end with '=' symbols. So, the base64
encoding of this data is 'dGVzdA==', with two '=' symbols to signify two pairs of bits will need to be removed from the end when this data gets decoded to make it match the original data.
Let's test this to see if I am being dishonest:
>>> encoded = base64.b64encode(data)
>>> print(encoded)
b'dGVzdA=='
Why use base64
encoding?
Let's say I have to send some data to someone via email, like this data:
>>> data = b'\x04\x6d\x73\x67\x08\x08\x08\x20\x20\x20'
>>> print(data.decode())
>>> print(data)
b'\x04msg\x08\x08\x08 '
>>>
There are two problems I planted:
- If I tried to send that email in Unix, the email would send as soon as the
\x04
character was read, because that is ASCII forEND-OF-TRANSMISSION
(Ctrl-D), so the remaining data would be left out of the transmission. - Also, while Python is smart enough to escape all of my evil control characters when I print the data directly, when that string is decoded as ASCII, you can see that the 'msg' is not there. That is because I used three
BACKSPACE
characters and threeSPACE
characters to erase the 'msg'. Thus, even if I didn't have theEOF
character there the end user wouldn't be able to translate from the text on screen to the real, raw data.
This is just a demo to show you how hard it can be to simply send raw data. Encoding the data into base64 format gives you the exact same data but in a format that ensures it is safe for sending over electronic media such as email.