Character encodings handling in Python or any other language can often seem painful. The newbie developers are mostly confused over exceptions like ‘UnicodeDecodeError’ and ‘UnicodeEncodeError.’

Python’s Unicode support is good and stable. However, it takes some time to learn it. There are several ways to encode text to binary data. It is different because it is not pointing towards a language but deliberately Python-centric.

Table of Contents:

  1. Character Encoding
  2. String Module
  3. Addition of Bits
  4. Unicode
  5. UTF-8 vs. Unicode
  6. Python – Unicode

Unicode Coding in Python

Character Encoding

Tens, if not hundreds, of character encodings, are available. The best way to begin to grasp what they are is to cover one of the simplest character encodings concept named ASCII. ASCII is a worthy place to kick start learning of character encoding because it is tiny and contained encoding.

It consists of the following:

  • Lowercase English letters: A to Z
  • Uppercase English letters: A to Z
  • Punctuation and Symbols: “$” and “!”, and a couple more.
  • Whitespace characters: A definite space (” “), a newline, horizontal tab, carriage return, vertical tab, and a few others.
  • Non-printable characters: Backspace characters, “\b,” cannot be printed literally in the way that letter ‘A’ may print.

It converts characters at a very high level (such as alphabets, punctuations, symbols, whitespaces, and character control) into integers and, finally, into bits. Each character may encode into a specific sequence of bits.

The different categories outlined reflect a group of characters. Each character has the corresponding encryption point, which acts as an integer. Characters are in different ranges within the ASCII table.

Code Range Class
0 to 31 Controls/non-printable characters
32 to 4 Symbols, numbers, punctuation, and spaces
65 to 90 Uppercase English alphabet letters
91 to 96 Additional ASCII characters, such as percent sign ‘%’ and double dagger ‘†’
97 to 122 Lower case English Alphabets A to Z
123 to 126 Additional ASCII characters, such as curly braces’ {‘and square bracket ‘[‘
127 non-printable characters such as (DEL)

The ASCII table covers 128 characters. If there is a character missing here in the table, the ASCII Encoding Scheme does not permit it to be printed.

String Module

Python’s string module is a simple one-stop-shop for string constants, which falls within the ASCII character set.

Here is a demonstration of a code in Python:

# From lib/python3.7/string.py
Whitespace = ' \t\n\r\v\f'
ascii_lowercase = 'abcdefghijklmnopqrstuvwxyz'
ascii_uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
ascii_letters = ascii_lowercase + ascii_uppercase
Digits = '0123456789'
hexdigits = digits + 'abcdef' + 'ABCDEF'
octdigits = '01234567'
punctuation = r"""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""
Printable = digits + ascii_letters + punctuation + whitespace

Here is a way to represent ASCII strings as bit sequences in Python. Every character in the ASCII string is pseudo-encoded into 8 bits, with spaces between the 8-bit sequences, representing a single character.

>>> def make_bitseq(s: str) -> str:
...     if not s.isascii():
...         raise ValueError("ASCII only allowed")
...     return " ".join(f"{ord(i):08b}" for i in s)

>>> make_bitseq("bits")
'01100010 01101001 01110100 01110011'

>>> make_bitseq("CAPS")
'01000011 01000001 01010000 01010011'

>>> make_bitseq("$25.43")
'00100100 00110010 00110101 00101110 00110100 00110011'

>>> make_bitseq("~5")
'01111110 00110101'

The f-string f”{ord(i):08b}” uses the Python Format Specification Mini-Language, which is a way to define the format for the replacement fields. The left side of the colon, ord (i), refers to the actual object whose value is formatted and inserted into the output. Using the Python ord () function, the user gets the base-10 code point for a single str character.

The specifier of the format is the right side of the colon. 08 means width 8, 0 padded, and b is used as a symbol to output the resulting number to base two. (Binary).

Addition of Bits

A critically important formula that relates to the concept of a bit. Given several bits, n, the number of possible distinct values expresses in n bits is 2n:

def n_possible_values(nbits: int) -> int:
    return 2 ** nbits

1-bit expresses 21 == 2 possible values.

8-bits expresses 28 == 256 possible values.

64-bits expresses 264 == 18,446,744,073,709,551,616 possible values.

>>> from math import ceil, log
>>> def n_bits_required(nvalues: int) -> int:
...     return ceil(log(nvalues) / log(2))
>>> n_bits_required(256)
8


The reason to use a ceiling in n bits required () is to account for values that are not clean powers of two.

>>> n_bits_required(110)
7

All of this helps to prove one concept: ASCII is, strictly speaking, a 7-bit language. The ASCII table includes 128 code points and characters, 0 through 127 inclusive. It requires 7-bit hence;

>>> n_bits_required(128)  # 0 through 127
7
>>> n_possible_values(7)
128

The problem with this is that modern computers do not store in 7-bit slots. They run in 8-bit units, conventionally known as bytes.

Unicode

Unicode serves the same function as ASCII, but it encompasses a much larger set of code points. Several encodings have chronologically appeared between ASCII and Unicode, as Unicode and UTF-8 have become so commonly used.

Think of Unicode as an enhanced version of the ASCII table—one with 1,114,112 conceivable code points. This is zero through 1,114,111, or 0 through 17 * (216)-1, or 0x10ffff hexadecimal. ASCII is a perfect Unicode subset.

Unicode itself is not encoding in the name of being technically exacting. Rather, Unicode is a sort of character encodings. It works as a map or a two-column database structure. It charts characters (like “v” “@” or even “†”) to separate, positive integers. Character encoding needs to do a little more. Unicode includes nearly every imaginable character, including additional non-printable characters such as an article containing paragraphs in English and Arabic.

UTF-8 vs. Unicode

Unicode does not provide how to get the individual bits from the text—just the code points. It does not provide enough information on how to transform text data into binary data and vice versa. Unicode is an abstract encoding standard. It is where UTF-8 and additional encoding schemes play their part. The Unicode standard separates different encoding patterns from a single set.

UTF-8 and its lesser-used relatives, UTF-16 and UTF-32, are encoding formats for the representation of Unicode patterns. It encodes data in the binary form of one or more bytes per character.

Python – Unicode

Python 3 is based on Unicode and UTF-8 explicitly.

  • The source code of Python 3 is UTF-8 by default.
  • This means that you do not need #—*—coding: UTF-8—*—at the top of the.py files in Python 3. By default, all text (str) is Unicode.
  • The default encoding for str.encode() and bytes.decode() is UTF-8.
  • Unicode encoded text is interpreted as binary data (bytes). Python 3 recognizes several Unicode code points in the identifiers, i.e., description = “~/Documents/resume.pdf” is correct if you want to.

There is another more complex property: the default encoding of the built-in function open () depends on the locale parameters. getpreferredencoding ().

 

>>> # Mac OS X High Sierra
>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'

>>> # Windows Server 2012; other Windows builds may use UTF-16
>>> import locale
>>> locale.getpreferredencoding()
'cp1252'