Follow

TIL the assumption that string length does not change when upper-cased is false!

@movonw But only, because that toUpperCase() call does not use or implement the upper case ẞ, which is a thing now.

@irrelefant I wonder if in other scripts/languages there are other similar cases though?

@irrelefant @movonw i wonder how well Node handles Dutch ij ➡️IJ

@irrelefant @movonw a similar thing happens when you transform Serbian Cyrillic to Latin. in Latin we don't have certain letters, they are Digraphs, but when capitalised we only capitalise the first letter Nj, Lj, Đž

@movonw Turkish comes to mind with the ¸ğ‘ which does not exist in an upper case variant, if I remember correctly

@irrelefant @movonw 'Ğ' is the upper-case of "soft g" in Turkish. It's also in the Azerbaijani alphabet, as well as the Latin alphabets of Zazaki, Laz, Crimean Tatar, Tatar, and Kazakh.

@movonw Related: you also can't assume that:

s.toLower().toUpper().toLower() == s.toLower()

because some case conversions are ambiguous. IIRC, Turkish has two different lowercase letters that upcase to the same.

@temporal @movonw which ones exactly? I can't think of any as a native :D

TBH it's late, i might not be thinking clearly

(That first sentence came out too angrily, i didn't mean it, just a question but sorry)

@baykanguru I checked and the case I was thinking about is that of dotted vs. dotless "i".

stackoverflow.com/questions/52

Turns out I was wrong about the source of the problem. The case conversion isn't ambiguous - if you're converting under the appropriate locale. If you aren't (and arguably most developers don't even know what locales are), the dotted "i" will get uppercased into dotless "I".

@movonw

@temporal @movonw localisation generally is a mess, basically, there deffinitely will be some stuff you haven't considered especially in stuff with a gui.

@temporal @movonw this is actually why Unicode has the concept of "case-folding" where you convert each character to a "case-folded" version that ensures that any characters which can be reached via a chain of toLower or toUpper all fold to the same character.

in most cases, this is just the lowercase form, but it has a few exceptions to watch out form

@movonw an equalsIgnoreCase call in Java involving a 65kb string can take up to 2000ms on an Intel i7-12700KF

@movonw
This is also particularly fun in programming languages with a Unicode-capable char-type.

'ß'.toUpperCase() cannot return a char. It needs to return a string or some char-iterable.

@movonw Are these edge-cases, top-cases, bottom-cases, brief-cases, or cold-cases?

Sign in to participate in the conversation
chaos.social

chaos.social – a Fediverse instance for & by the Chaos community