Use case for multibyte safe programming

You won't notice anything is wrong, until a Norwegian guy called Øystein Øvretveit signs up and your website breaks. What could have happened?

Let me explain. In PHP, a technique to get a single character from a string is to treat the string as an array, or to use a function like sub⁠­str(). Here is the array approach:

$string = 'This is a string';
$string[0]; // first character: T
$string[1]; // second: h
$string[-1]; // last: g

This also works in Javascript. A big difference with Javascript is that PHP is not doing this in a 'multibyte safe' way. More on this later on. See:

$string1 = 'Café';
$string2 = 'Österreich';
$string1[0]; // C
$string1[-1]; // �
$string2[0]; // �
$string2[-1]; // h

If these � characters are part of an array or string and used as input for a JSON endpoint, they could end up breaking the complete REST service. JSON encoders can crash because of broken characters. In Drupal you could get a PHP error like this one:

Symfony\Component\Serializer\Exception\NotEncodableValueException: Malformed UTF-8 characters, possibly incorrectly encoded in Symfony\Component\Serializer\Encoder\JsonEncode->encode() (regel 63 van /Users/dries/sites/🦄/vendor/symfony/serializer/Encoder/JsonEncode.php).

Maybe it's the array technique; how about sub⁠­str() to get single characters and substrings? Let's try:

sub⁠­str('Österreich', 0, 100); // Österreich
sub­⁠str('Österreich', 0, 10); // Österreic
sub⁠­str('Österreich', 0, 3); // Ös
sub­str('Österreich', 0, 2); // Ö
sub⁠­str('Österreich', 0, 1); // �
sub­str('Österreich', 0, 0); // empty string
sub⁠­str('Café', 0, 4); // Caf� 
sub⁠­str('Café', 0, 5); // Café
strlen('Café'); // 5
strlen('é'); // 2
strlen('🤯'); // 12

There is something strange going on. Diacritics like 'é' and 'Ö' are not seen as a single character. Functions like substr, strlen, and [] use the number of bytes to extract substrings, completely ignoring the representation of the strings. There are functions in PHP that fix this behavior, the so called 'Multibyte String Functions'.

mb_substr('Österreich', 0, 1); //Ö
mb_substr('Café', 0, 4); // Café
mb_strlen('Café'); // 4
mb_strlen('é'); // 1
mb_strlen('👍'); // 12

Check out Multibyte String Functions on PHP.net

So, referring back to the first paragraph: what could have happened? Maybe there was a block on the homepage populated by an API, a block containing all the recently registered users. As a nice bonus, the initials were displayed for every user without avatar. The function taking the first characters of the full name, ØØ in this case, returned ��, throwing an exception in JsonEncode.php, creating a cascade of errors crashing the homepage.

It's a good practice to use these 'mb_' functions if you are planning on manipulating strings, certainly when these are sourced from user input. Diacritics are rare in the English language, but not non-existent. Some words like the aforementioned café, and names like Chloë or Renée could break your website. For most other languages it's even more important to be aware of this potential problem.

More articles