View Issue Details

IDProjectCategoryView StatusLast Update
0020431mantisbtdb schemapublic2023-10-31 16:36
Reporterdregad Assigned Todregad  
PrioritynormalSeverityminorReproducibilityN/A
Status assignedResolutionopen 
Target Version2.27.0 
Summary0020431: Use utf8mb4 charset for new MySQL installations
Description

We currently create the database with 'utf8' charset and 'general_ci' collation.

In MySQL, utf8 charset uses up to 3 bytes, which means some characters can't be stored properly. Since MySQL 5.5.3 the 'utf8mb4' charset is available, and does not have this limitation.

We should use utf8mb4 for new installation.

In addition, we default to 'general' collation which sometimes yield incorrect sort order for special (multibyte) characters. See MySQL documentation [1] for examples and more explanation.

[1] http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html

Tagsschema

Relationships

related to 0008017 closeddregad Increase the size of the username field 
parent of 0020465 closeddregad Reduce size of username and email fields to allow utf8mb4 charset 
has duplicate 0025691 closedatrol UTF8_mb4 allows unicode emoji's without own grafics 
related to 0021101 closeddregad Issues with emoji's are truncated before getting saved 
related to 0021841 closeddregad Minimum requirements for 2.x releases 
related to 0023549 closedatrol Entering Emojis in comments with a user mention crashes with an error 

Activities

dregad

dregad

2015-12-30 18:21

developer   ~0052209

Switching to utf8mb4 charset may not be as simple as I thought initially...

MySQL limits the maximum index key size depending on the engine used:

  • MyISAM: 1000 bytes [1]
  • InnoDB: 767 bytes [2] in MySQL < 5.7.7 (unless 'innodb_large_prefix' option is enabled), 3072 bytes in 5.7.7 and above (or when innodb_large_prefix == ON)

Considering 4 bytes per characters, the maximum length of a character field used as index key is:

  • MyISAM: 1000 / 4 = 250 chars
  • InnoDB: 767 / 4 = 191 chars

Schema update step 196 updated the user table's username field to 255 chars in length (see 0008017). Converting the table to utf8mb4 or trying to create it from scratch with that charset triggers the following error

1071: Specified key was too long; max key length is 767 bytes

We have the following options:

  1. forget about utf8mb4, keep using 3-byte utf8 (eventually, someone will face issues as they try to store 4-byte unicode chars, e.g. emoji or some CJK characters)
  2. reduce size of username column from 255 to 191 chars
  3. change the idx_user_username index to become a 'prefix' index, i.e. to only cover the first 191 chars of the username field

Option 3 is probably acceptable, since it seems highly unlikely that we would have email addresses longer than 191 chars to begin with, and even less to have 2 of them differ only on or after the 192nd char. Nevertheless there is still the technical possiblity that this situation will occur.

I would recommend number 2 as the safest option, and assuming we don't really need 255 chars for the username. I don't think we do; the original requirement from 0008017 was to allow storing email addresses as user identifier; 255 was set arbitrarily.

[1] http://dev.mysql.com/doc/refman/5.7/en/myisam-storage-engine.html
[2] http://dev.mysql.com/doc/refman/5.7/en/innodb-restrictions.html

dregad

dregad

2015-12-30 18:23

developer   ~0052210

Reminder sent to: atrol, vboctor

Your opinion on 0020431:0052209 would be appreciated.

vboctor

vboctor

2015-12-31 05:25

manager   ~0052211

Looking at the code under admin/check/ it seems that our current minimum requirement for MySQL is 5.0.8 and hence if we are planning to use utf8mb4, then we need to up such requirement to 5.5.3. Not sure how common is 5.5.3. Though if we end up waiting, then we could make the move directly to 5.7.7 where it wouldn't be necessary to reduce the field size.

I don't have a problem with reducing the max size for username. I assume based on the above, this is the only offending field.

atrol

atrol

2015-12-31 12:25

developer   ~0052214

To wait for 5.7.7 is no option.
Even the next version of LTS Ubuntu comes with an older version.

Ubuntu LTS versions that come with 5 years of security updates [1]
12.04 5.5.22
14.04 5.5.35
16.04 5.6.27

Red Hat Enterprise Linux is quite another story [2]
RHEL-5.1 5.0.95
RHEL-6.7 5.1.73
RHEL-7.2 no longer MySQL but MariaDB 5.5.44

So requiring 5.5.3 would rule out Red Hat.

It seems we should stay with utf8 at the moment.

[1] http://distrowatch.com/table.php?distribution=ubuntu
[2] http://distrowatch.com/table.php?distribution=redhat

dregad

dregad

2015-12-31 18:21

developer   ~0052215

Thanks to both of you for your comments.

Actually my intention was not to enforce utf8mb4 across the board, but rather to detect the mysql version at install time, and define the charset accordingly (utf8 if < 5.5.3, utf8mb4 otherwise), allowing instances running recent software to benefit from better unicode support..

We can keep MySQL 5.0.8 as minimal support, even though that's effectively end-of-life since 2013.

With regards to MariaDB, at least until version 5.5 it's supposed to be a "drop-in replacement" for MySQL, so (in theory) RHEL 7.2 should be just fine and use utf8mb4.

Not sure how things will evolve with MariaDB 10.x / MySQL 5.7 though, but that's another topic.

dregad

dregad

2016-01-01 19:25

developer   ~0052230

Pull request https://github.com/mantisbt/mantisbt/pull/699

vboctor

vboctor

2016-01-03 20:43

manager   ~0052239

Last edited: 2016-01-03 20:46

@dregad How about the option is using a prefix index and updating the API(s) that looks up by username to potentially handle more than one match? That is assuming the DBMS won't filter these independent of the index anyway. Which I think it should.

dregad

dregad

2016-01-04 10:39

developer   ~0052240

The problem is not with filtering, it is about ensuring the key's uniqueness, which can't be guaranteed with a prefix index.

Updating the API would not resolve this issue, and sounds like overengineering for an issue that is anyway quite unlikely to occur (have you ever heard of a 191-char long e-mail address ?)

dregad

dregad

2016-06-13 06:17

developer   ~0053358

Interesting note on Drupal's approach to handle index size limitation on utf8mb4 fields https://www.drupal.org/node/1314214

dregad

dregad

2016-06-18 16:32

developer   ~0053413

In order to (greatly) simplify the implementation of utf8mb4 charset support, including upgrade steps and future maintenance, I would like to propose that we increase the minimum version requirement for MySQL to 5.5.3 in 2.0.x.

5.5.3 was released in March 2010. It was actually a milestone release, the General availability is 5.5.8 (released 2010-12-03), but I don't see the need to require a higher version than we actually need.

As pointed out by atrol, this will prevent some distros from running our software, but I think that's an acceptable trade-off.

For the record, Drupal followed a similar approach, as documented in the link I referenced in my previous post.

atrol

atrol

2016-06-18 17:12

developer   ~0053416

we increase the minimum version requirement for MySQL to 5.5.3 in 2.0.x.

I would prefer this instead of implementing workarounds to fix 0021101.
Maybe target for 2.1.x if someone think it's to early to enforce 5.5.3.

dregad

dregad

2016-06-18 17:35

developer   ~0053418

I would prefer this instead of implementing workarounds to fix 0021101.

I agree, but I also believe that we do need a solution for 1.x as well, since depending on MySQL settings, use of any 4-byte char will either result in

  • the offending char and everything after it to be silently truncated
  • a DB error to occur, preventing data from being saved to the DB

The workaround I propose in 0021101 / PR https://github.com/mantisbt/mantisbt/pull/797 is fairly simple, and prevents an error which is more and more likely to occur as people using Mantis on smartphones are used to inserting emojis.

Of course will need to be reverted once utf8mb4 support has been implemented.

j_schultz

j_schultz

2016-11-02 10:37

reporter   ~0054374

This blog post suggests a different and interesting workaround compared to the pull requests referred to in the previous post: https://roartindon.blogspot.de/2015/04/hacking-utf16-to-work-around-mysqls.html
Rather than replacing everything with <?>, surrogate pairs are used. It requires some unicode decoding code so maybe it's not worth the effort, but still an interesting idea.

dregad

dregad

2016-11-24 09:01

developer   ~0054574

Changing target version since we'll have MySQL 5.5.3 as minimum requirement there.

atrol

atrol

2016-11-24 09:47

developer   ~0054577

since we'll have MySQL 5.5.3 as minimum requirement there

Did you consider 0020431:0052214 ?
Added also note to 0021841:0054575

travm1

travm1

2017-04-07 00:09

reporter   ~0056429

There is some generic info about UTF8mb4 here: https://www.everipedia.com/UTF8/

dregad

dregad

2017-04-07 03:21

developer   ~0056430

There is some generic info about UTF8mb4 here: https://www.everipedia.com/UTF8/

@travm1 what is your point ?

thE_iNviNciblE

thE_iNviNciblE

2019-04-11 16:37

reporter   ~0061894

cool that mantis will to that out of the box utf8mb4, without manuell database stuff with this quite cool mysql, mssql, postgre tool https://www.heidisql.com/ :-)

now mantis is able to process unicode... stuff like this is possible https://unicode.org/emoji/charts/full-emoji-list.html
would be more nice to work with smileys while typing reports, better workplace, not so conservativ.
you don't need grafics anymore.

thE_iNviNciblE

thE_iNviNciblE

2019-04-11 16:41

reporter   ~0061895

stuff like http://www.sonderzeichen.de/Emoticons-Alphabet.html this html charakter also works with 😁 for example ... � ���

i'll do a ticket, for supporting input with smileys...

thE_iNviNciblE

thE_iNviNciblE

2019-04-11 16:43

reporter   ~0061896

is the bugstracker mysql database really under UTF8mb4 correctly ?

in my previous ticket are the correct unicode smileys, normaly they are working... we all see not the smiley but �

rogueresearch

rogueresearch

2022-06-21 10:46

reporter   ~0066776

So my mantis mysql database is very old, initially created many many years ago. It still uses latin1 and latin1_swedish_ci. Since this issue is not fixed, I suppose I should migrate to (3 byte) utf8 instead of utfmb4 for now?

dregad

dregad

2022-06-22 03:58

developer   ~0066777

@rogueresearch, if you are going to migrate your database from latin1 now, IMHO you might as well go straight for utfmb4. I don't expect any negative impact from this (since utf8mb4 is a superset of the limited utf8 3-byte unicode encoding which was standard until MySQL 8), and you'll be ready for 4-byte unicode without the need for another migration. Just make sure you change the encoding for all tables and columns as well.

Related Changesets

MantisBT: master-1.3.x 805ef0cb

2016-06-18 12:42

dregad


Details Diff
New database API function db_mysql_fix_utf8()

This new function replaces 4-byte UTF-8 chars by Unicode U+FFFD
character for MySQL databases.

This is a temporary workaround to avoid data getting truncated on MySQL
databases using native utf8 encoding which only supports 3 bytes chars,
until we're able to support utf8mb4 charset (see issue 0020431).

Fixes 0021101
Affected Issues
0020431, 0021101
mod - core/database_api.php Diff File