What is the difference between utf8mb4 and utf8 charsets in MySQL?
The utf8mb4
character set is useful because nowadays we need support for storing not only language characters but also symbols, newly introduced emojis, and so on.
A nice read on How to support full Unicode in MySQL databases by Mathias Bynens can also shed some light on this.
UTF-8 is a variable-length encoding. In the case of UTF-8, this means that storing one code point requires one to four bytes. However, MySQL's encoding called "utf8" (alias of "utf8mb3") only stores a maximum of three bytes per code point.
So the character set "utf8"/"utf8mb3" cannot store all Unicode code points: it only supports the range 0x000 to 0xFFFF, which is called the "Basic Multilingual Plane". See also Comparison of Unicode encodings.
This is what (a previous version of the same page at) the MySQL documentation has to say about it:
The character set named utf8[/utf8mb3] uses a maximum of three bytes per character and contains only BMP characters. As of MySQL 5.5.3, the utf8mb4 character set uses a maximum of four bytes per character supports supplemental characters:
For a BMP character, utf8[/utf8mb3] and utf8mb4 have identical storage characteristics: same code values, same encoding, same length.
For a supplementary character, utf8[/utf8mb3] cannot store the character at all, while utf8mb4 requires four bytes to store it. Since utf8[/utf8mb3] cannot store the character at all, you do not have any supplementary characters in utf8[/utf8mb3] columns and you need not worry about converting characters or losing data when upgrading utf8[/utf8mb3] data from older versions of MySQL.
So if you want your column to support storing characters lying outside the BMP (and you usually want to), such as emoji, use "utf8mb4". See also What are the most common non-BMP Unicode characters in actual use?.
Taken from the MySQL 8.0 Reference Manual:
utf8mb4
: A UTF-8 encoding of the Unicode character set using one to four bytes per character.
utf8mb3
: A UTF-8 encoding of the Unicode character set using one to three bytes per character.
In MySQL utf8
is currently an alias for utf8mb3
which is deprecated and will be removed in a future MySQL release. At that point utf8
will become a reference to utf8mb4
.
So regardless of this alias, you can consciously set yourself an utf8mb4
encoding.
To complete the answer, I'd like to add the @WilliamEntriken's comment below (also taken from the manual):
To avoid ambiguity about the meaning of
utf8
, consider specifyingutf8mb4
explicitly for character set references instead ofutf8
.