Unicode Stereotype
Unicode is a confusing concept that had never been properly taught in computer science program at University. Well my CS program didn’t cover it. I confess that I have the stereotype of it is something not so important and useful.
Recently I had to deal with it at work. This time, it is even more confusing since I am dealing with Unicode in Python 2. Time to dive deep!
This blog is my highlights and summaries on some great material I found over Unicode and character set.
Unicode is a theoretical concept.
Unicode is about how to theoretical present EVERY character in the universe. It’s NOT about storing character in compute(disk/memory).
Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. - Joel Spolsky
In Unicode, a letter maps to something called a code point, which is still just a theoretical concept.
U+0639 is the Arabic letter Ain. The English letter A would be U+0041
Encoding
It does not make sense to have a string without knowing what encoding it uses. - Joel Spolsky
There Ain’t No Such Thing As Plain Text. - Joel Spolsky
But how to store code points like
U+0048
in computer(memory and storage)? Encoding!- UTF16: Encode
U+0048
in two bytes00 48
(Each number takes 4 bits. aka hex number)
- UTF8: Encode
U+0048
in one bytes48
- But what if it's
U+7148
? Well, Americans don't care since English characters are fine with one byte.
Unicode in Python
In Python 2, str contains sequences of 8-bit values, unicode contains sequences of Unicode characters. str and unicode can be used together with operators if the str only contains 7-bit ASCII characters. - Effective Python
In Python 3, bytes contains sequences of 8-bit values, str contains sequences of Unicode characters. bytes and str instances can’t be used together with operators (like > or +). - Effective Python
Reference
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky:
- Effective Python: 59 Specific Ways to Write Better Python
- Cover photo from unicode.org