From the article: > I think UTF-32 is the simplest. Each number is put into a 32...

e4m2 · on June 12, 2023

UTF-32 is a fixed-width encoding. To quote directly from the Unicode standard, chapter 3.9, definition 90:

> UTF-32 encoding form: The Unicode encoding form that assigns each Unicode scalar value to a single unsigned 32-bit code unit with the same numeric value as the Unicode scalar value.

UCS-4 is defined in ISO 10646. Before the 2011 revision of the standard UCS-4 had a 31-bit codespace, since then the definitions of the ISO and Unicode standards have converged and UCS-4 and UTF-32 are now synonymous.

> while UCS-4 currently encodes all of Unicode, there's no guarantee it will "forever".

There is. Unicode has a codespace of U+0000..U+10FFFF. There are 2^16 + 2^20 − 2^11 assignable codepoints, in Unicode 15 a total of 149,186 (13.4%) have been assigned.

Even if the Unicode codespace were to ever be extended again (it won't), the only encoding that would become incompatible is UTF-16. In fact, both UTF-8 and UTF-32 are trivially extensible and used to be wider encodings, but were restricted to 0x10FFFF arbitrarily to match UTF-16 limitations.

EDIT: My wording is very ambiguous. There are actually 286,719 (25.8%) assigned codepoints in Unicode 15: https://www.unicode.org/versions/stats/charcountv15_0.html. Still, the point stands.

chrismorgan · on June 12, 2023

> Even if the Unicode codespace were to ever be extended again (it won't), the only encoding that would become incompatible is UTF-16. In fact, both UTF-8 and UTF-32 are trivially extensible and used to be wider encodings, but were restricted to 0x10FFFF arbitrarily to match UTF-16 limitations.

Mind you, it wouldn’t be this easy, because things should perform Unicode validation, and many do, so every piece of software would need to be updated for the new, enlarged version of Unicode, UTF-8 and UTF-32, and old software that validated would baulk or convert anything from the new ranges into REPLACEMENT CHARACTER.

But yeah, UTF-16 would be toast.

e4m2 · on June 12, 2023

True, another extension would be disastrous, we may never recover from the fallout of malformed UCS2-esque UTF-16. Let's just hope no one fixes all the broken, accidentally forward-compatible decoders in the practically infinite amount of time it will take to completely fill the Unicode codespace.