Drupal - Why are we using utf8mb4_general_ci and not utf8mb4_unicode_ci?
It seems to me that the recommendation is outdated and that utf8mb4_unicode_ci
will work without problems. It has been used by a lot of people for a long time.
There is a difference between changing the character set from utf8
to utf8mb4
(to support more codepoints) and changing the collation from general_ci
to unicode_ci
(to get more accurate sorting). Both changes can cause their own problems, so doing both independently makes sense.
utf8mb4
is used by default since 8.0.0-beta12. The main issue seemed to be a change of key lengths limitations for InnoDB but as I understand it, utf8mb4
should have worked with the default MyISAM engine even before that change.
Switching to unicode_ci
shouldn't cause problems, but may unexpectedly changes the order of sorting for some sites.
The default collation setting is just a default and modules can choose their own collations anyway if they need to. I also haven't found any documentation that says modules should expect a certain collation. The database install guide just lacks a clear statement about which collations are supported and is inconsistent:
In the section about phpMyAdmin it says that you have to
Make sure you select COLLATION utf8_general_ci
Later in the section about installation from command line,
general_ci
doesn't seem to be required and any UTF-8 collation will do:Note: The database should be created with UTF-8 (Unicode) encoding, for example utf8_general_ci.
Furthermore, PostgreSQL is supported and it seems its default UTF-8 collation is equivalent to utf8mb4_unicode_ci
, so using that with MySQL should be fine too.