Drupal - Why are we using utf8mb4_general_ci and not utf8mb4_unicode_ci?

It seems to me that the recommendation is outdated and that utf8mb4_unicode_ci will work without problems. It has been used by a lot of people for a long time.


There is a difference between changing the character set from utf8 to utf8mb4 (to support more codepoints) and changing the collation from general_ci to unicode_ci (to get more accurate sorting). Both changes can cause their own problems, so doing both independently makes sense.

utf8mb4 is used by default since 8.0.0-beta12. The main issue seemed to be a change of key lengths limitations for InnoDB but as I understand it, utf8mb4 should have worked with the default MyISAM engine even before that change.

Switching to unicode_ci shouldn't cause problems, but may unexpectedly changes the order of sorting for some sites.

The default collation setting is just a default and modules can choose their own collations anyway if they need to. I also haven't found any documentation that says modules should expect a certain collation. The database install guide just lacks a clear statement about which collations are supported and is inconsistent:

  • In the section about phpMyAdmin it says that you have to

    Make sure you select COLLATION utf8_general_ci

  • Later in the section about installation from command line, general_ci doesn't seem to be required and any UTF-8 collation will do:

    Note: The database should be created with UTF-8 (Unicode) encoding, for example utf8_general_ci.

Furthermore, PostgreSQL is supported and it seems its default UTF-8 collation is equivalent to utf8mb4_unicode_ci, so using that with MySQL should be fine too.

Tags:

Database