Hacker News new | comments | show | ask | jobs | submitlogin
Unicode in five minutes (2013) (richardjharris.github.io)
279 points by jstanley 1 month ago | hide | past | web | 41 comments | favorite





I originally laughed at "in five minutes", but even though I do not think the article reads in five minutes, it does a surprisingly good job of covering the basics: so good job!

I do wonder if it is clear for people who are unfamiliar with Unicode? Anyone who is mostly unfamiliar with the details article covers who can say how comprehensible the article is?

I would also add a mention of the standard Unicode collation table that does a passable job for many languages at the same time (though Unicode Collation Algorithm is mentioned, which this is the default for, I think it's worth highlighting this property of most UCA implementations).

As for the article gotchas, multilingual text is even more complex when go past 5 minutes even for "simple" European scripts. Eg. in Bosnian/Croatian/Serbian in Roman/Latin alphabet, "nj" will be capitalized to "Nj" or "NJ" depending on the rest of the word — eg. "Njegoš" or "NJEGOŠ"; confusingly, Unicode also includes digraphs for both capitalization forms (the eternal tension in Unicode between encoding letters, glyphs or characters), even though they are linguistically equivalent — in practice, they are never used, which makes their inclusion even more perplexing (they are always spelled out using two characters, and there was no historical reason since none of the 8-bit encodings had them)! It will also sometimes be two distinct letters, especially in loanwords like "konjugovan" — this makes things harder when you need to collate texts since the proper order would be "konjugovan", "kontakt", "konj".

All of this is why I like to joke how Cyrillic script is technically much better for all of these languages, even though it is basically in official use only for the Serbian language — in Cyrillic, there is no conundrum in either of the above examples since nj=њ (or нј), Nj/NJ=Њ, and the order is clear: конјугован, контакт, коњ.


> I originally laughed at "in five minutes", but even though I do not think the article reads in five minutes, it does a surprisingly good job of covering the basics: so good job!

Slightly off topic, but just to riff on this a bit: maybe books and articles called "$THING in $NUMBER_OF $TIME_PERIODS" or "Learn $THING in $NUMBER_OF $TIME_PERIODS" should be retitled "$NUMBER_OF $TIME_PERIODS with $THING." It would be more accurate, not imply any sort of mastery, and, on top of that, sound a little more dignified. But, maybe it wouldn't sell as many books, so... ¯\_(ツ)_/¯.


Joel Spolsky's 2003 Joel On Software piece: "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...


This meanwhile fell behind with the times. I would recommend the submitted article rather than Spolsky's.

I'm not so sure about that, Spolsky's article seems way better at introducing someone to Unicode if they don't know anything about it. The OP article goes way deeper and has more interesting insights about Unicode itself though.

Disclaimer: might be biased because I've discovered Unicode through Spolsky's article.


I didn't post it as an alternative but as a "see also."

This is a really great summary of Unicode. I wish it had been available when I first started getting into the complexities of multilingual string searching and normalization. Ultimately, reading the official documentation (unicode.org) was necessary, but a succinct and clearly written introduction like this would have saved me hours (if not days) of effort.

Yeah, this is the kind of cut-to-the-damn-chase I want 90% of the time as an experienced developer touching technology I don't necessarily deep-dive every day, like an actual example of what NFKC does.

Even if it's too topical to be actionable in every case, it gives you the general idea and vocabulary to put together useful search queries when you want to know more.


Honestly. It's so frustrating when people go on tangents and say in three paragraphs what you could say in two short sentences. An example: the rust book.


Something complementary -- because this article just takes a moment to talk about the different encoding schemes -- a wonderful, terse, very informative video describing how utf-8 encoding works and why (with a little history) by Tom Scott/Computerphile: https://www.youtube.com/watch?v=MijmeoH9LT4

I always loved the whimsy present in Unicode. For nostalgia, here's a HN post from 2010 pointing to the `Unicode Snowman for You` site (which is still up!)

https://news.ycombinator.com/item?id=2035572

And the site:

http://xn--n3h.net/


According to the HTML source, the original site was:

http://unicodesnowmanforyou.com/

I wish I understood what the keepers of Unicode were thinking by including so much bloat in a character set (or character encoding). I realize that Unicode is going to have a huge number of symbols no matter what, if they're going to represent all the world's languages and math and punctuation, but I'd draw the line at emoticons, emojis, playing card symbols, and snowmen.


One of the major goals of Unicode was to support round trip conversion from all the widely-used character sets into Unicode, and then back out again. In particular, supporting popular Japanese character sets was important for technical and commercial reasons.

There was a lot of weird stuff in the world's character sets.

Emoji were first used by Japanese cell phone carriers. They were encoded as Shift JIS characters, but in incompatible ways. The Unicode Consortium had no real interest in this until Google and Apple basically said, "If we're going to have to support all these character sets, could we please standardize them?"

I think it's just the reality of standardizing the world's character sets. A lot of weird legacy stuff will slip in, and other countries will want to standardize things that seem unnecessary. Personally, I'm very thankful that somebody wants to do all the exhausting political work of coming to a consensus. A few snowmen are small price to pay.


Playing card symbols are -- like chess symbols -- typeset inline with ordinary text in books that deal with the strategy of those games. So, IMO it makes since to include them in a character set that a font and typesetting engine will support.

Well right now it's about two percent of unicode, right?

And people use them as text, so there's a reason to add them and not much reason to refuse them.


You might be right, but where are you getting the 2% from? Are you thinking of just emoticons, emojis, playing card symbols, and snowmen? There's more than that I'd question.

I looked up how many emoji there were, added some for wingdings, and rounded up a bit.

What else would you question? Would it be more than 1500 more, which would bump it from 2 to 3 percent?


Hm... what's going on here?

    $ host xn--n3h.net
    host: 'xn--n3h.net.' is not a legal IDNA2008 name (string contains a disallowed character), use +noidnout
Looks like emoji were forbidden in IDNA2008... :'(

It misses the security considerations for names. Almost nobody knows about nor implements that. Eg for filenames or variable names.

If you're interested in this... https://en.wikipedia.org/wiki/Homoglyph

Also recently spotted as an avenue for attack in the wild:

Magecart group uses homoglyph attacks to fool you into visiting malicious websites: https://www.zdnet.com/article/magecart-group-uses-homoglyph-...

Homoglyph attacks used in phishing campaign and Magecart attacks https://securityaffairs.co/wordpress/106916/hacking/homoglyp...

https://cisomag.eccouncil.org/homoglyph-attacks/


You can easily avoid homoglyph attacks or similar stuff by following the relevant unicode security considerations. I'm on Moderately Restrictive level for General Security Profile for identifiers. http://perl11.org/blog/unicode-identifiers.html

Forbidding mixed scripts fixes this attack. You also need to normalized names, and a few more minor things.


Unicode is weird...this prints out backwards (including the comma and space) in the python3 repl:

  >>> [chr(0x07c0+i) for i in range(10)]
  ['߀', '߁', '߂', '߃', '߄', '߅', '߆', '߇', '߈', '߉']
0..9 in the N'Ko script BTW...

I don't get what you mean by backwards.

    py3> [chr(0x07c0+i) for i in range(10)]
    ['߀', '߁', '߂', '߃', '߄', '߅', '߆', '߇', '߈', '߉']

    
    js> [...Array(10)].map((_,i)=>String.fromCodePoint(0x07c0+i))
    ['߀', '߁', '߂', '߃', '߄', '߅', '߆', '߇', '߈', '߉']

I believe he's trying to print the 0..9 range by providing the proper start and end point for those characters but instead gets 9..0 (I don't know the script but I'm basing it off by the 0 at the end). So for instance 0x07c0 stands for 0 in Nko script, and this is his starting point, but the entire sequence ends up being reversed. I'm not sure how comparing it to JS helps here other than I guess pointing out that it's also behaving unexpectedly.

Wait, I just realized the results in my repl (0..9) are reversed from what I pasted into HN (9..0). And if you shrink the width of the browser to force my HN snippets to wrap, it changes the order. And it selects in the reverse order on click and drag.

I spoke way too soon. Unicode is weird. My apologies to our friend UncleEntity.


It looks like it depends on how your terminal (or the browser, or anything that renders it) handles Unicode (which I guess just means that Unicode is hard to get right): https://i.imgur.com/8FPNYMP.png

It's how the directionality (right to left or left to right) is decided that is complicated for mixed texts (and always nothing but a heuristic).

I must admit that I was surprised that the following snippet kept the LTR order in my terminal:

>> [(chr(ord('0')+i), chr(0x07c0+i)) for i in range(10)] [('0', '߀'), ('1', '߁'), ('2', '߂'), ('3', '߃'), ('4', '߄'), ('5', '߅'), ('6', '߆'), ('7', '߇'), ('8', '߈'), ('9', '߉')] >>> [(chr(0x07c0+i), chr(ord('0')+i)) for i in range(10)] [('߀', '0'), ('߁', '1'), ('߂', '2'), ('߃', '3'), ('߄', '4'), ('߅', '5'), ('߆', '6'), ('߇', '7'), ('߈', '8'), ('߉', '9')]


The variation selector link is dead, but is archived.

https://web.archive.org/web/20160417233039/http://babelstone...


I've worked with Unicode for years and thought I had a good handle on its mechanics until I discovered this feature of the system last year. I was puzzling out why some symbol code points sometimes render in flat character style and other times as more graphic emoji, even when the same font and same code point is used in each case. Turned out it was a matter of applying VS15 or VS16 as a combining character, and which was the default for a given code point. Incredibly detailed stuff that this archived BabelStone article goes into in much greater depth than the bit I wrote about my exploration: https://khephera.net/posts/a-unicode-woe-solved/

> it gives a (double-story) and a (single-story) the same codepoint.

But they did see fit to have ɑ (LATIN SMALL LETTER alpha)which is distinct from α (GREEK SMALL LETTER ALPHA).



This is first short intro to Unicode I have seen where the reader does not leave thinking that one user perceived character must be just one code point.

Although they do mistakenly refer to ffi (U+FB03) as a character. Still better than most intros though.

It is a character (using Unicode's nomenclature).

It's not up to Unicode to decide; "ffi" is three distinct characters, not one.

And for that matter, even [unicode 88] admits that it isn't a character.

unicode 88: http://www.unicode.org/history/Unicode88.pdf search for "A ligature is a glyph"


Just reading an article about Japanese Saito surname and how hard the idea of “uni”code (or possibly dropped idea of Hans unification) is problematic in real life situation. Yes you may have a codepoint but it is only part of the problem especially related to human name.

This is a good introduction, unfortunately, Unicode may ultimately be a problem in and of itself.

To start, consider that the term 'character' used in the article, though 'generally correct' ... is definitely not correct in the broadest sense.

Western, Cyrillic and Asian scripts boil down to 'characters' with some complexity maybe with ligatures ('Straße'), but it falls apart quickly for other languages.

Unfortunately, rather than creating rigorously applied definitions for things, and applying them consistently, even Unicode falls into this bureaucratic trap of vagaries with their own definitions.

So Unicode works well for most things, but then it falls off a cliff.

Here is the definitions section [1]

Even have a look at the definitions of 'Character' and 'Grapheme' and 'Grapheme Cluster' - and you start to see how confusion sets in very quickly.

Consider that in Unicode ... there isn't really such a thing as a 'character' - it's just an unspecific word we use that has no technical application! (When we say 'character' generally what we mean is 'Grapheme Cluster').

Language is itself a rabbit hole of complexity, so any standard trying to manage it will be painful - but it feels as though the true corner cases of Unicode are actually unbounded.

In short, too many pragmatic loose ends. Given any scenario where you think you have an alg sorted out ... and probably there are holes in it if you cared to try to find them for a specific language.

It's not 'bad', but it's not the uber solution, it's frayed at the edges.

[1] https://unicode.org/glossary/


  > Consider that in Unicode ... there isn't really such a thing as a 'character' 
This is a really important consideration, since it helps you realize the immense difficulty of wrapping your own logic for character-aware handling - unless you are deliberately limiting your scope, like only handling NFC-normalized text of a limited number of languages.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: