View Issue Details

IDProjectCategoryView StatusLast Update
0021101mantisbtbugtrackerpublic2017-10-26 07:04
Reportervboctor Assigned Todregad  
PrioritynormalSeverityminorReproducibilityalways
Status closedResolutionfixed 
Product Version1.3.0-rc.2 
Target Version1.3.0Fixed in Version1.3.0 
Summary0021101: Issues with emoji's are truncated before getting saved
Description

The following line is expected to be truncated when saved to the database and email is sent on the emoji after Z3 but there is text after that which I pasted in.

Not compatible with my Xperia Z3

Tagsmantishub
Attached Files
emoji_text.txt (97 bytes)   
Not compatible with my Xperia Z3 😢😢 any help would be great as this game looks amazing 👍
emoji_text.txt (97 bytes)   
Selection_002.png (10,486 bytes)   
Selection_002.png (10,486 bytes)   

Relationships

related to 0020431 assigneddregad Use utf8mb4 charset for new MySQL installations 
related to 0023549 closedatrol Entering Emojis in comments with a user mention crashes with an error 

Activities

dregad

dregad

2016-06-13 06:10

developer   ~0053357

I added the text you sent by e-mail as an UTF-8 text file attachment for the record.

Emoji are stored as 4-byte Unicode characters, so I would guess that the issue is a side effect of our using MySQL's 'UTF8' charset, which only supports 3 bytes chars. See 0020431 and more specifically my note 0020431:0052209:

"1. [...] (eventually, someone will face issues as they try to store 4-byte unicode chars, e.g. emoji or some CJK characters)".

vboctor

vboctor

2016-06-13 11:17

manager   ~0053366

I've tried this string with other services and some of them replace the emojis with ??? but don't truncate the text. Can we do a similar work around until the db support is done?

dregad

dregad

2016-06-18 09:34

developer   ~0053407

Can we do a similar work around until the db support is done?

Certainly.

I believe the simplest would be to simply replace any UTF-8 char > U+10000 by a given character or string (I'd suggest we use U+FFFD - �).

Question is, do we also need/want to somehow store the original character too ? e.g. for the crying face example you reported, we could replace by something like '�[U+1f622]'. I'm not sure it's worth the effort.

That could make the display look bad if echoed as-is, especially if there are a lot of "invalid" characters (e.g. a sentence in Chinese) but on the other hand it would allow us to

  • display the original character (at the expense of an extra preg_replace() call for each text display of course; this could be done in MantisCoreFormatting
  • convert any occurence found in the DB back to the original character (by means of an upgrade function) once utf8mb4 support has been implemented

This being a workaround, to minimize the impact on the code base, I would also limit applying this to key selected fields; I would say: bug summary, description, steps to reproduce, additional info and bugnote text.

Let me know your thoughts.

dregad

dregad

2016-06-18 09:44

developer   ~0053408

Proof-of-concept: see attached screenshot 'Selection_002.png'

vboctor

vboctor

2016-06-18 10:28

manager   ~0053409

Looks good. I would go with the simple approach of replacing 4-byte unicode characters with �. Similar to what you have done in proof of concept.

dregad

dregad

2016-06-18 16:33

developer   ~0053414

OK then. I'll submit a pull request after applying the workaround to the 3 bug fields.

Will also need to check if this does not also cause issues in history and bug_revision tables.

dregad

dregad

2016-06-18 17:22

developer   ~0053417

PR https://github.com/mantisbt/mantisbt/pull/797

dregad

dregad

2016-06-18 18:27

developer   ~0053422

For the record, a couple helper functions I used while testing

<pre>
function utf8_chr( $ordinal ) {
return mb_convert_encoding( '&#' . (int)$ordinal . ';', 'UTF-8', 'HTML-ENTITIES');
}

function utf8_ord( $p_char ) {
$char = mb_substr( $p_char, 0, 1, 'utf-8' );
$size = strlen( $char );

$ordinal = ord( $char[0] ) & ( 0xFF >> $size );
for( $i = 1; $i &lt; $size; $i++ ) {
    $ordinal = $ordinal &lt;&lt; 6 | ( ord( $char[$i] ) & 0x7F );
}
return $ordinal;

}
</pre>

Related Changesets

MantisBT: master-1.3.x 805ef0cb

2016-06-18 12:42

dregad


Details Diff
New database API function db_mysql_fix_utf8()

This new function replaces 4-byte UTF-8 chars by Unicode U+FFFD
character for MySQL databases.

This is a temporary workaround to avoid data getting truncated on MySQL
databases using native utf8 encoding which only supports 3 bytes chars,
until we're able to support utf8mb4 charset (see issue 0020431).

Fixes 0021101
Affected Issues
0020431, 0021101
mod - core/database_api.php Diff File

MantisBT: master-1.3.x 4dcb16cc

2016-06-18 12:48

dregad


Details Diff
Fix 4-byte UTF-8 chars issues on MySQL

This applies the new db_mysql_fix_utf8() function to the following
fields:

- bug.summary
- bug.description
- bug.steps_to_reproduce
- bug.additional_information
- bugnote.text
- custom fields

Fixes 0021101
Affected Issues
0021101
mod - core/bug_api.php Diff File
mod - core/bugnote_api.php Diff File
mod - core/cfdefs/cfdef_standard.php Diff File
mod - core/custom_field_api.php Diff File