Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

From the article:

> I think UTF-32 is the simplest. Each number is put into a 32-bit integer, or 4 bytes. This is called a “fixed-width” encoding.

This is wrong. The 4-byte fixed encoding is UCS-4, and while UCS-4 currently encodes all of Unicode, there's no guarantee it will "forever". Just as people who assumed UCS-2 was "all" of Unicode have since been proven (very) wrong.

UTF-32 is a variable width encoding even if right now there's no reason to use surrogates out to the next plane (particularly because the next plane is entirely inaccessible to UTF-16 and only UTF-16, but UTF-16 is still one of the world's most common encodings).



UTF-32 is a fixed-width encoding. To quote directly from the Unicode standard, chapter 3.9, definition 90:

> UTF-32 encoding form: The Unicode encoding form that assigns each Unicode scalar value to a single unsigned 32-bit code unit with the same numeric value as the Unicode scalar value.

UCS-4 is defined in ISO 10646. Before the 2011 revision of the standard UCS-4 had a 31-bit codespace, since then the definitions of the ISO and Unicode standards have converged and UCS-4 and UTF-32 are now synonymous.

> while UCS-4 currently encodes all of Unicode, there's no guarantee it will "forever".

There is. Unicode has a codespace of U+0000..U+10FFFF. There are 2^16 + 2^20 − 2^11 assignable codepoints, in Unicode 15 a total of 149,186 (13.4%) have been assigned.

Even if the Unicode codespace were to ever be extended again (it won't), the only encoding that would become incompatible is UTF-16. In fact, both UTF-8 and UTF-32 are trivially extensible and used to be wider encodings, but were restricted to 0x10FFFF arbitrarily to match UTF-16 limitations.

EDIT: My wording is very ambiguous. There are actually 286,719 (25.8%) assigned codepoints in Unicode 15: https://www.unicode.org/versions/stats/charcountv15_0.html. Still, the point stands.


> Even if the Unicode codespace were to ever be extended again (it won't), the only encoding that would become incompatible is UTF-16. In fact, both UTF-8 and UTF-32 are trivially extensible and used to be wider encodings, but were restricted to 0x10FFFF arbitrarily to match UTF-16 limitations.

Mind you, it wouldn’t be this easy, because things should perform Unicode validation, and many do, so every piece of software would need to be updated for the new, enlarged version of Unicode, UTF-8 and UTF-32, and old software that validated would baulk or convert anything from the new ranges into REPLACEMENT CHARACTER.

But yeah, UTF-16 would be toast.


True, another extension would be disastrous, we may never recover from the fallout of malformed UCS2-esque UTF-16. Let's just hope no one fixes all the broken, accidentally forward-compatible decoders in the practically infinite amount of time it will take to completely fill the Unicode codespace.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: