Converting to Base64 in JavaScript without Deprecated 'Escape' call

TL;DR In principle escape()/unescape() are not necessary, and your second version without the deprecated functions is safe, yet it generates longer base64 encoded output:

  • console.log(decodeURIComponent(atob(btoa(encodeURIComponent("€uro")))))
  • console.log(decodeURIComponent(escape(atob(btoa(unescape(encodeURIComponent("€uro")))))))

both create the output "€uro" yet the version without escape()/unescape() with a longer base64 representation

  • btoa(encodeURIComponent("€uro")).length // = 16
  • btoa(unescape(encodeURIComponent("€uro"))).length // = 8

The escape()/unescape() step can only become necessary if the counterpart (e.g. an unadjustable php-Script expecting the base64 to be done in the specific way.).

Long version:

First, to better understand the differences in between the two versions of toBase64() and fromBase64() that you suggest above, let us have a look to the btoa() which is at the core of the issue. Documentation says, that the naming of btoa is mnemonic so that

"b" can be considered to stand for "binary", and the "a" for "ASCII".

which is somewhat misleading, as the documentation hastens to add, that

in practice, though, for primarily historical reasons, both the input and output of these functions are Unicode strings.

Even less perfect, btoa() is indeed only accepting

characters in the range U+0000 to U+00FF

plainly spoking only only English alpha-numeric-text works with btoa().

The purpose of encodeURIComponent(), which you have in both of your versions, is to help out with strings having character outside the range U+0000 to U+00FF. An example would be the string "uü€" having three characters

  • a (U+0061)
  • ä (U+00E4)
  • (U+20AC)

Here only the two first characters are in range. The third character, the Euro sign, is outside and window.btoa("€") raises an out of range error. To avoid such an error a solution is needed to represent "€" within the set of U+0000 to U+00FF. This is what window.encodeURIComponent does:

window.encodeURIComponent("uü€")
creates the following string:
"a%C3%A4%E2%82%AC" in which some characters have been encoded

  • a = a (stayed the same)
  • ä = %C3%A4 (changed to its utf8 representation)
  • = %E2%82%AC (changed to its utf8 representation)

The (changed to its utf8 representation) works by using the character "%" and a two digit number for each byte of the character's utf8 representation. The "%" is U+0025 and hence allowed inside the btoa()-range. The result of window.encodeURIComponent("uü€") can then be fed to btoa() as it has no out of range characters anymore:

btoa("a%C3%A4%E2%82%AC") \\ = "YSVDMyVBNCVFMiU4MiVBQw=="

The crux of using an unescape() in between the btoa() and the encodeURIComponent() is that all bytes of the utf8 representation use up 3 characters %xx to store all potential values of a byte 0x00 to 0xFF. Here is where unescape() can play an optional role. This is because unescape() takes all such %xx bytes and creates in its place a single Unicode character in the allowed U+0000 to 0+00FF range.

To check :

  • btoa(encodeURIComponent("uü€"))).length // = 24
  • btoa(unescape(encodeURIComponent("uü€"))).length // = 8

the main difference is a length reduction of the base64 representation of the text, at the cost of additional parsing via the optional escape()/unescape(), which in case of mainly ASCII character set text is minimal anyway.

The main lesson to understand is that btoa() is misleadingly named and requires Unicode U+0000 to U+00FF characters which encodeURIComponent() by itself generates. The deprecated escape()/unescape() only has a space saving feature, which is maybe desirable but not necessary. The problem of Unicode symbols > U+00FF is addressed here as the btoa/atob Unicode problem, which mentions even ways to improve "all UTF8 Unicode" to base64 encoding possible in modern browsers.


TL;DR / Short Summary

Don't use btoa(encodeURIComponent(str)) and decodeURIComponent(atob(str)) - that's “nonsense”.

convert string to Base64” usually means “encode string as UTF-8 and encode the bytes as Base64”, and that's exactly what btoa(unescape(encodeURIComponent(str))) does. btoa(encodeURIComponent(str)) is doing something else that isn't useful for any case I can imagine, even though it never throws an error as explained in humanityANDpeaces detailed answer.



What does “convert string to Base64” mean?

Base64 is a binary-to-text encoding, a sequence of bytes is encoded as a sequence of ASCII characters.1 It is therefore not possible to directly encode text as Base64. It is conceptually always a two step procedure:

  1. convert string to bytes (using some character encoding)
  2. encode bytes as Base64

You can principally use any character encoding (also called charset2 or Encoding Scheme) you want, it just needs to be able to represent all needed characters and it has to be the same for both directions (text to Base64 and Base64 to text). As there are many different character encodings, the protocol or API should define which one is used. If an API expects a "string encoded via Base64" and doesn't mention the character encoding, you can nowadays usually assume, that UTF-8 encoding is expected.3

Base64-encoding the bytes from step 1 is pretty straightforward:
a) Take three input bytes to get 24 bits.
b) Split into four chunks of 6 bits each, to get four numbers in range 0...63.
c) Translate numbers to ASCII chars via table and add these chars to the output
d) Goto a)
More details about Base64 itself are out of the scope of this answer.

What does btoa do?

By now you might think: “This answer can't possibly be correct. It claims, that it is not possible to directly encode text as Base64, even though this is exactly what btoa does - it takes text and spits out Base64.

No. It does not take text and returns Base64, it takes an argument of type string and returns Base64. But that string argument doesn't represent text, it is just a strange way to store a sequence of bytes. Each byte is represented by a character whose numerical code point value is equal to the value of the byte.4

A Note in the HTML standard says, that “the "b" can be considered to stand for "binary", and the "a" for "ASCII". ” Contrary to popular opinion, I don't think, that btoa is named badly. It does not take text, it takes binary data and produces an ASCII string using Base64, so a short form of “binary to ascii” is an absolutely correct name. It's the argument type, that is misleading.

The definition of btoa in the HTML standard simply says:

[...] the user agent must convert that argument to a sequence of octets whose nth octet is the eight-bit representation of the code point of the nth character of the argument, and then must apply the base64 algorithm to that sequence of octets, and return the result.

I don't know and probably will never know, why they didn't chose a different argument type e.g. an array of numbers. Maybe the performance wasn't as good at the time when btoa was first specified?

What does unescape(encodeURIComponent(str)) do?

By now you could think: “If the first step in converting text to Base64 is encoding the text to bytes, then how is btoa(unescape(encodeURIComponent(str))) achieving that? btoa doesn't do that, but neither unescape nor encodeURIComponent seem to be in any way related to character encoding?

Actually, encodeURIComponent is related to character encoding. The standard says:

The encodeURIComponent function computes a new [...] URI in which each instance of certain code points is replaced by [...] escape sequences representing the UTF-8 encoding of the code point.

So now we have the percent-encoded UTF-8 bytes. To convert the percent-encoded bytes to a binary string suitable for btoa, one can use unescape, because the behavior description states among other things:

  • If c is the code unit 0x0025 (PERCENT SIGN), then
    • [... how to decode %uXXXX ...]
    • Else if k ≤ length - 3 and [... two hexdigits follow ...] then
      • Set c to the code unit whose value is the integer represented by [...] the two hexadecimal digits at indices k + 1 and k + 2 within string.

Therefore after encodeURIComponent stored the UTF-8 bytes as %XX, unescape turns them into single codepoints exactly as required by btoa. So all in all btoa(unescape(encodeURIComponent(str))) encodes text to UTF-8 bytes which are then encoded to Base64.

Back to the original question

In case you forgot, the question was:

(1) Why did the originally proposed solution include calls to escape() and unescape()? The solution was proposed prior to deprecation but presumably these functions added some kind of value at the time.

(2) Are there certain edge cases where my removal of these deprecated calls will cause my wrapper functions to fail?

Without unescape you don't get a Base64 representation of a UTF-8 encoded string. btoa(encodeURIComponent(str)) encodes text to some strange bytes (not a standardized Unicode Encoding Scheme, but the bytes one can get by storing an URI-encoded string as ASCII) which are then encoded as Base64. So unescape is necessary for standard conformance -- OK, encodeURIComponent and ASCII are also standardized, but nobody will expect that strange combination.

If only you yourself are converting to and from Base64, then yes you could use btoa(encodeURIComponent(str)) and it will never throw an error as explained in humanityANDpeaces detailed answer (Question (2) is sufficiently answered I think).

But in that case you could much better just use the result of encodeURIComponent directly. It already is pure ASCII and is always shorter than btoa(encodeURIComponent(str)). If you want smaller size than encodeURIComponent(str) you can use btoa(unescape(encodeURIComponent(str))) (smaller if input string contains more non-ASCII chars).

If you convert to Base64, because some other party, API or protocol expects Base64, then you simply can not use btoa(encodeURIComponent(str)), because nobody understands the result.

Oh, and btoa(unescape(encodeURIComponent(str))) couldn't really be “proposed prior to deprecation” of unescape:
unescape was removed from the standard in version 3, the same version that added encodeURIComponent. unescape was still explained in the document, but was moved to Annex B.2, whose introduction stated, that it “suggests uniform semantics [...] without making the properties or their semantics part of this standard.” But as browsers have to be backwards compatible, it probably won't be removed any time soon.


Try for yourself:

function run(){
    let Base64Function=new Function("str", $("#algorithm").val());
    let base64=Base64Function($("#input").val());
    $("#Base64Text").text("Output: "+base64);
    let charset=$('#charset').val();
    let uri="data:text/plain"
           +(charset?";charset="+charset:'')
           +($("#interpret").prop('checked')?";base64":'')
           +","+base64;
    $("#dataURI").text(uri);
    $("#dataURI").attr('href', uri);
    $("#Base64iframe").attr('src',uri);
}
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>

<label for="input">Text to encode:</label>
<input type="text" id="input" value="abc€ðŸ˜€"/><br />

<label for="algorithm">Encode function:</label>
<input type="text" id="algorithm" size="50"/><br />

<button type="button" onclick="run();">Run</button>
Defaults:
<button type="button" onclick='
    $("#algorithm").val("return btoa(unescape(encodeURIComponent(str)))");
    $("#charset").val("UTF-8");
    $("#interpret").prop("checked",true);
'>UTF-8 Base64</button>
<button type="button" onclick='
    $("#algorithm").val("return btoa(encodeURIComponent(str))");
    $("#charset").val(""); //I don't know - it's not UTF-8
    $("#interpret").prop("checked",true);
'>wrong</button>
<button type="button" onclick='
    $("#algorithm").val("return encodeURIComponent(str)");
    $("#charset").val("UTF-8");
    $("#interpret").prop("checked",false);
'>without btoa (not Base64)</button>
<br />

<div id="Base64Text">Output:</div>

<label for="charset">Interpret as this character encoding:</label>
<input type="text" id="charset" /><br />

<label for="interpret">Interpret as Base64:</label>
<input type="checkbox" id="interpret" /><br />

<div><a id="dataURI"></a></div>
<iframe id="Base64iframe"></iframe>

This snippet tests the Base64 result by creating a dataURI, but the concept applies to other applications of Base64 as well.


Note:

In some quotations I use [ and ] to leave out or shorten things that are unimportant in my opinion.
[... some text ...] is obviously not part of the source.

Footnotes:

1 The standard says that Base64 “is designed to represent arbitrary sequences of octets” (octet means byte consisting of eight bits)

2 A character set is not exactly the same as a character encoding. However a coded character set can always be considered to implicitly define a character encoding, therefore "character set" and "character encoding" are often used as synonyms. Maybe it once was the same? Sometimes the term charset is explicitly used as a short term for character encoding and not for character set.

3 At least UTF-8 is very dominant for websites. Also see UTF-8 Everywhere

4 This is effectively the ISO_8859-1 encoding, but I wouldn't think of it this way. Better think bytes[i]==str.charCodeAt(i).