Unicode and Character Sets in Python
Unicode and Character Sets in Python

Unicode and Character Sets in Python

Rating
Slug
Tags
Software Engineering
Publish Date
Aug 16, 2021

Unicode Stereotype

Unicode is a confusing concept that had never been properly taught in computer science program at University. Well my CS program didn’t cover it. I confess that I have the stereotype of it is something not so important and useful.
Recently I had to deal with it at work. This time, it is even more confusing since I am dealing with Unicode in Python 2. Time to dive deep!
This blog is my highlights and summaries on some great material I found over Unicode and character set.

Unicode is a theoretical concept.

💡
Unicode is about how to theoretical present EVERY character in the universe. It’s NOT about storing character in compute(disk/memory).
Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. - Joel Spolsky
In Unicode, a letter maps to something called a code point, which is still just a theoretical concept.
U+0639 is the Arabic letter Ain. The English letter A would be U+0041

Encoding

It does not make sense to have a string without knowing what encoding it uses. - Joel Spolsky
There Ain’t No Such Thing As Plain Text. - Joel Spolsky
But how to store code points like U+0048 in computer(memory and storage)? Encoding!
  • UTF16: Encode U+0048 in two bytes 00 48(Each number takes 4 bits. aka hex number)
  • UTF8: Encode U+0048 in one bytes 48
    • But what if it's U+7148? Well, Americans don't care since English characters are fine with one byte.

Unicode in Python

In Python 2, str contains sequences of 8-bit values, unicode contains sequences of Unicode characters. str and unicode can be used together with operators if the str only contains 7-bit ASCII characters. - Effective Python
In Python 3, bytes contains sequences of 8-bit values, str contains sequences of Unicode characters. bytes and str instances can’t be used together with operators (like > or +). - Effective Python
 

Reference