My Unicode cheat sheet
python, julia, vim, rust, unicode
I wanted to make a cheat sheet for myself containing a reference of things I use when it comes to Unicode and when using Unicode in Vim, Python, Julia and Rust.
First some basics:
Unicode Code Points1 are unique mappings from hexadecimal integers to an abstract character, concept or graphical representation. These graphical representations may look visually similar but can represent different
ideas
. For example: A, Α, А, A are all different Unicode code points.- A : U+0041 LATIN CAPITAL LETTER A
- Α : U+0391 GREEK CAPITAL LETTER ALPHA
- А : U+0410 CYRILLIC CAPITAL LETTER A
- A : U+FF21 FULLWIDTH LATIN CAPITAL LETTER A
The Unicode consortium defines a Grapheme2 as a
What a user thinks of as a character
. Multiple code points may be used to represent a grapheme. For example, my name in Devangari and Tamil can be written as 3 graphemes, but it consists of 4 and 5 code points respectively in these languages:- DEVANGARI: दीपक
- द : U+0926 DEVANAGARI LETTER DA
- ी : U+0940 DEVANAGARI VOWEL SIGN II
- प : U+092A DEVANAGARI LETTER PA
- क : U+0915 Dec:2325 DEVANAGARI LETTER KA
- TAMIL: தீபக்
- த : U+0BA4 TAMIL LETTER TA
- ீ : U+0BC0 TAMIL VOWEL SIGN II
- ப : U+0BAA TAMIL LETTER PA
- க : U+0B95 TAMIL LETTER KA
- ் : U+0BCD TAMIL SIGN VIRAMA
Additionally, multiple
ideas
may be defined as a single code point. For example, the following grapheme ﷺ translates topeace be upon him
and is defined as the code point at U+FDFA:- ﷺ : U+FDFA ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
And to make matters more complicated, graphemes and visual representations of code points may not be a single column width wide, even in monospaced fonts. See the code point at U+FDFD:
- ﷽ : U+FDFD ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM
Code points can be of different categories, Normal, Pictographic, Spacer, Zero Width Joiners, Controls etc.
The same
idea
, i.e. code point can be encoded into different bits when it is required to be represented on a machine. The bits used to represent the idea depend on the encoding chosen. An encoding is a map or transformation of a code point into bits or bytes. For example, the code point for a 🐉 can be encoded into UTF-8, UTF16, UTF32 in Python as follows.3.7.6 (default, Jan 8 2020, 13:42:34) Python 'copyright', 'credits' or 'license' for more information Type 7.16.1 -- An enhanced Interactive Python. Type '?' for help. IPython 1]: s = '🐉' In [ 2]: s.encode('utf-8') In [2]: b'\xf0\x9f\x90\x89' Out[ 3]: s.encode() # Python3 uses 'utf-8' by default In [3]: b'\xf0\x9f\x90\x89' Out[ 4]: s.encode('utf-16') In [4]: b'\xff\xfe=\xd8\t\xdc' Out[ 5]: s.encode('utf-32') In [5]: b'\xff\xfe\x00\x00\t\xf4\x01\x00' Out[
Python prints the bytes as human readable characters if they are valid ASCII characters. ASCII defines 128 characters, half of the 256 possible bytes in an 8-bit computer system. Valid ASCII byte strings are also valid UTF-8 byte strings.
7]: s = 'hello world' In [ 7]: s.encode('ascii') In [7]: b'hello world' Out[ 8]: s.encode('utf-8') In [8]: b'hello world' Out[ 9]: s.encode('utf-16') In [9]: b'\xff\xfeh\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00' Out[
When receiving or reading data, we must know the encoding used to interpret it correctly. A Unicode encoding is not guaranteed to contain any information about the encoding. Different encodings exist for efficiency, performance and backward compatibility. UTF-8 is a good pick for an encoding in the general case.
Vim
In vim in insert mode, we can type Ctrl+V
3 followed by either:
3 aside: Check out :help i_CTRL-V_digit
for more information.
- a decimal number [0-255].
Ctrl-v255
will insertÿ
. - the letter
o
and then an octal number [0-377].Ctrl-vo377
will insertÿ
. - the letter
x
and then a hex number [00-ff].Ctrl-vxff
will insertÿ
. - the letter
u
and then a 4-hexchar Unicode sequence.Ctrl-vu03C0
will insertπ
. - the letter
U
and then an 8-hexchar Unicode sequence.Ctrl-vU0001F409
will insert🐉
.
Using unicode.vim
, we can use :UnicodeName
to get the Unicode number of the code point under the cursor. With unicode.vim
and fzf
installed, you can even fuzzy find Unicode symbols.
Python
Since Python >=3.3, the Unicode string type supports a flexible string representation
. This means that any one of multiple internal representations may be used depending on the largest Unicode ordinal (1, 2, or 4 bytes) in a Unicode string.
For the common case, a string used in the English speaking world may only use ASCII characters thereby using a Latin-1 encoding to store the data. If non Basic Multilingual Plane characters are used in a Python Unicode string, the internal representation may be stored as UCS2 or UCS4.
In each of these cases, the internal representation uses the same number of bytes for each code point. This allows efficient indexing into a Python Unicode string, but indexing into a Python Unicode string will only return a valid code point and not a grapheme. The length
of a Unicode string is defined as the number of code points in the string.
As an example, let’s take this emoji: 🤦🏼♂️ [1]. This emoji actually consists of 5 code points4:
4 aside: We can view this breakdown using uniview. In vim
, we can use :UnicodeName
.
- 🤦 : U+1F926 FACE PALM
- 🏼 : U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3
- : U+200D ZERO WIDTH JOINER
- ♂ : U+2642 MALE SIGN (Ml)
- ️: U+FE0F VARIATION SELECTOR-16
In Python, a string that contains just this emoji has length equal to 5.
3.7.6 (default, Jan 8 2020, 13:42:34)
Python 'copyright', 'credits' or 'license' for more information
Type 7.16.1 -- An enhanced Interactive Python. Type '?' for help.
IPython
1]: s = "🤦🏼♂️"
In [
2]: s
In [2]: '🤦🏼\u200d♂️'
Out[
3]: print(s)
In [
🤦🏼♂️
4]: len(s)
In [4]: 5 Out[
If we want to keep a Python file pure ASCII but want to use Unicode in string literals, we can use the \U
escape sequence.
5]: s = '\U0001F926\U0001F3FC\u200D\u2642\uFE0F'
In [
6]: print(s)
In [ 🤦🏼♂️
As mentioned earlier, indexing into a Python Unicode string gives us the code point at that location.
6]: s[0]
In [6]: '🤦'
Out[
7]: s[1]
In [7]: '🏼'
Out[
8]: s[2]
In [8]: '\u200d'
Out[
9]: s[3]
In [9]: '♂'
Out[
10]: s[4] # this may look like an empty string but it is not.
In [10]: '️'
Out[
11]: len(s[4]), s[4].encode("utf-8")
In [11]: (1, b'\xef\xb8\x8f')
Out[
12]: len(''), ''.encode("utf-8")
In [12]: (0, b'')
Out[
13]: s[5]
In [---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-42-b5dece75d686> in <module>
----> 1 s[5]
IndexError: string index out of range
Iterating over a Python string gives us the code points as well.
14]: [c for c in s]
In [14]: ['🤦', '🏼', '\u200d', '♂', '️'] Out[
However, in practice, indexing into a string may not be what we want or may not be useful. More often, we are either interested in:
- indexing into the byte string representation or
- indexing into the graphemes.
We can use the s.encode('utf-8')
function to get a Python byte string representation of the Python unicode string in s
.
15]: s
In [15]: '🤦🏼\u200d♂️'
Out[
16]: len(s)
In [16]: 5
Out[
17]: type(s)
In [17]: str
Out[
18]: s.encode("utf-8")
In [18]: b'\xf0\x9f\xa4\xa6\xf0\x9f\x8f\xbc\xe2\x80\x8d\xe2\x99\x82\xef\xb8\x8f'
Out[
19]: len(s.encode("utf-8"))
In [19]: 17
Out[
20]: type(s.encode("utf-8"))
In [20]: bytes Out[
If we are interested in the number of graphemes, we can use the grapheme
package.
21]: import grapheme
In [
22]: grapheme.length(s)
In [22]: 1
Out[
23]: s = s + " Why is Unicode so complicated?"
In [
24]: grapheme.slice(s, 0, 1)
In [24]: '🤦🏼\u200d♂️'
Out[
25]: grapheme.slice(s, 2)
In [25]: 'Why is Unicode so complicated?' Out[
For historical reasons, Unicode allows the same set of characters to be represented by different sequences of code points.
26]: single_char = 'ê'
In [= '\N{LATIN SMALL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
...: multiple_chars
27]: single_char
In [27]: 'ê'
Out[
28]: multiple_chars
In [28]: 'ê'
Out[
29]: len(single_char)
In [29]: 1
Out[
30]: len(multiple_chars)
In [30]: 2 Out[
We can use the built in standard library unicodedata
to normalize Python Unicode strings.
31]: import unicodedata
In [
32]: len(unicodedata.normalize("NFD", single_char))
In [32]: 2 Out[
It is best practice to add the following lines to the top of your Python file that you expect to run as scripts.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
If your Python files are part of a package, just adding the second line is sufficient. I recommend using pre-commit hooks to ensure that the encoding pragma of python files are fixed before making a git commit.
Julia
Let’s take a look at how Julia handles strings. This is the version of Julia that I’m using:
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.5.0 (2020-08-01)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia>
> s = "🤦🏼♂️"
julia"🤦🏼\u200d♂️"
> println(s)
julia
🤦🏼♂️
> length(s)
julia5
> ncodeunits(s)
julia17
> codeunit(s)
juliaUInt8
Printing the length of the string in Julia returns 5
. As we saw earlier, this is the number of code points in the unicode string.
Julia String
literals are encoded using the UTF-8 encoding. In Python, the indexing into a string would return the code point at the string. In Julia, indexing into a string refers to code units5, and for the default String
this returns the byte as a Char
type.
> s[1]
julia'🤦': Unicode U+1F926 (category So: Symbol, other)
> typeof(s[1])
juliaChar
> s[2]
julia: StringIndexError("🤦🏼\u200d♂️", 2)
ERROR:
Stacktrace1] string_index_err(::String, ::Int64) at ./strings/string.jl:12
[2] getindex_continued(::String, ::Int64, ::UInt32) at ./strings/string.jl:220
[3] getindex(::String, ::Int64) at ./strings/string.jl:213
[4] top-level scope at REPL[12]:1
[
> s[3]
julia: StringIndexError("🤦🏼\u200d♂️", 3)
ERROR:
Stacktrace...]
[
> s[4]
julia: StringIndexError("🤦🏼\u200d♂️", 4)
ERROR:
Stacktrace...]
[
> s[5]
julia'🏼': Unicode U+1F3FC (category Sk: Symbol, modifier)
> s[6]
julia: StringIndexError("🤦🏼\u200d♂️", 6)
ERROR:
Stacktrace...]
[
> s[7]
julia: StringIndexError("🤦🏼\u200d♂️", 7)
ERROR:
Stacktrace...]
[
> s[8]
julia: StringIndexError("🤦🏼\u200d♂️", 8)
ERROR:
Stacktrace...]
[
> s[9]
julia'\u200d': Unicode U+200D (category Cf: Other, format)
> s[10]
julia: StringIndexError("🤦🏼\u200d♂️", 10)
ERROR:
Stacktrace...]
[
> s[11]
julia: StringIndexError("🤦🏼\u200d♂️", 11)
ERROR:
Stacktrace...]
[
> s[12]
julia'♂': Unicode U+2642 (category So: Symbol, other)
> s[13]
julia: StringIndexError("🤦🏼\u200d♂️", 13)
ERROR:
Stacktrace...]
[
> s[14]
julia: StringIndexError("🤦🏼\u200d♂️", 14)
ERROR:
Stacktrace...]
[
> s[15]
julia'️': Unicode U+FE0F (category Mn: Mark, nonspacing)
> s[16]
julia: StringIndexError("🤦🏼\u200d♂️", 16)
ERROR:
Stacktrace...]
[
> s[17]
julia: StringIndexError("🤦🏼\u200d♂️", 17)
ERROR:
Stacktrace...]
[
> s[18]
julia: BoundsError: attempt to access String
ERROR18]
at index [:
Stacktrace...] [
If we want each code point in a Julia String
, we can use eachindex
6.
6 aside: See the Julia manual strings documentation for more information: https://docs.julialang.org/en/v1/manual/strings/
> [s[i] for i in eachindex(s)]
julia5-element Array{Char,1}:
'🤦': Unicode U+1F926 (category So: Symbol, other)
'🏼': Unicode U+1F3FC (category Sk: Symbol, modifier)
'\u200d': Unicode U+200D (category Cf: Other, format)
'♂': Unicode U+2642 (category So: Symbol, other)
'️': Unicode U+FE0F (category Mn: Mark, nonspacing)
And finally, we can use the Unicode
module that is built in to the standard library to get the number of graphemes.
> using Unicode
julia
> graphemes(s)
julia-1 GraphemeIterator{String} for "🤦🏼♂️"
length
> length(graphemes(s))
julia1
If we wish to encode a Julia string as UTF-87, we can use the following:
7 aside: As of Julia v1.5.0, only conversion to/from UTF-8 is currently supported: https://docs.julialang.org/en/v1/base/strings/#Base.transcode
> transcode(UInt8, s)
julia17-element Base.CodeUnits{UInt8,String}:
0xf0
0x9f
0xa4
0xa6
0xf0
0x9f
0x8f
0xbc
0xe2
0x80
0x8d
0xe2
0x99
0x82
0xef
0xb8
0x8f
Rust
Let’s also take a look at rust. We can create a simple main.rs
file:
// main.rs
fn main() {
let s = "🤦🏼♂️";
println!("{}", s);
println!("{:?}", s);
dbg!(s);
dbg!(s.len());
for (i, b) in s.bytes().enumerate() {
println!("s.bytes()[{}] = {:#x}", i, b);
}
dbg!(s.chars().count());
for (i, c) in s.chars().enumerate() {
println!("s.chars()[{}] = {:?}", i, c);
}
}
And compile and run it like so:
$ rustc main.rs && ./main
🤦🏼♂️
"🤦🏼\u{200d}♂\u{fe0f}"
[main.rs:11] s = "🤦🏼\u{200d}♂\u{fe0f}"
[main.rs:13] s.len() = 17
s.bytes()[0] = 0xf0
s.bytes()[1] = 0x9f
s.bytes()[2] = 0xa4
s.bytes()[3] = 0xa6
s.bytes()[4] = 0xf0
s.bytes()[5] = 0x9f
s.bytes()[6] = 0x8f
s.bytes()[7] = 0xbc
s.bytes()[8] = 0xe2
s.bytes()[9] = 0x80
s.bytes()[10] = 0x8d
s.bytes()[11] = 0xe2
s.bytes()[12] = 0x99
s.bytes()[13] = 0x82
s.bytes()[14] = 0xef
s.bytes()[15] = 0xb8
s.bytes()[16] = 0x8f
[main.rs:19] s.chars().count() = 5
s.chars()[0] = '🤦'
s.chars()[1] = '🏼'
s.chars()[2] = '\u{200d}'
s.chars()[3] = '♂'
s.chars()[4] = '\u{fe0f}'
There are also additional crates such as unicode-width
and unicode-segmentation
.
unicode-width
helps determine how many column widths a grapheme will occupy based on the Unicode Standard Annex #11 rules. For example abc
occupies 3 columns but 写作业
occupies 6 columns but they are both 3 codepoints and 3 graphemes each. unicode-segmentation
helps with determining the number of graphemes in a string.
// main.rs
use unicode_width::UnicodeWidthStr;
use unicode_segmentation::UnicodeSegmentation;
fn main() {
let s = "abc";
dbg!(s);
dbg!(s.len());
dbg!(s.width());
dbg!(s.graphemes(true).count());
println!("");
let s = "写作业";
dbg!(s);
dbg!(s.len());
dbg!(s.width());
dbg!(s.graphemes(true).count());
println!("");
let s = "🤦🏼♂️";
dbg!(s);
dbg!(s.len());
dbg!(s.width());
dbg!(s.graphemes(true).count());
}
$ rustc main.rs && ./main
[src/main.rs:6] s = "abc"
[src/main.rs:7] s.len() = 3
[src/main.rs:8] s.width() = 3
[src/main.rs:9] s.graphemes(true).count() = 3
[src/main.rs:14] s = "写作业"
[src/main.rs:15] s.len() = 9
[src/main.rs:16] s.width() = 6
[src/main.rs:17] s.graphemes(true).count() = 3
[src/main.rs:22] s = "🤦🏼♂️"
[src/main.rs:23] s.len() = 17
[src/main.rs:24] s.width() = 5
[src/main.rs:25] s.graphemes(true).count() = 1
References
Reuse
Citation
@online{krishnamurthy2020,
author = {Krishnamurthy, Dheepak},
title = {My {Unicode} Cheat Sheet},
date = {2020-09-19},
url = {https://kdheepak.com/blog/my-unicode-cheat-sheet/},
langid = {en}
}