"UTF-16 native" doesn't mean your "UTF-16 unit" (i.e. what would `charCodeAt` re...

chrismorgan · on June 12, 2023

> JS implementations already do not use UTF-16 as a sole native representation anyway

But they still expose UTF-16 code unit semantics: U+10000–U+10FFFF come through as two UTF-16 code units, a surrogate pair.

> "UTF-16 native" doesn't mean your "UTF-16 unit" (i.e. what would `charCodeAt` return) should not exceed 16 bits.

True enough. You could redefine it to have UTF-16 code unit semantics until U+10FFFF, then code point semantics beyond, maybe call this “UTF-16+ code unit semantics”. It’d definitely break a few things here and there, but then, anything would break a few things here and there, regardless of encoding type, since the range U+0000–U+10FFFF is hard-coded in so may places. Well, anything except doing surrogates again. And I’m pretty sure that’s what would be done.

lifthrasiir · on June 12, 2023

Exactly this. In fact I first thought of https://ucsx.org/ then realized that JS strings do allow lone surrogates, and then further realized that none of this don't matter at all because the external storage format is entirely disconnected to the JS semantics. So as long as we can agree on which storage format to standardize and upgrade everything accordingly (which would be really a drama), JS wouldn't make an additional dent.