謝宗翰的隨筆 : [機器學習] 關於 Tokenization

最近在看 tokenization 的數學基礎，發現非常有趣，以下簡單介紹其相關的基礎內容，有興趣的讀者可以參考標準計算理論的書比如 M. Sipser, Introduction to the Theory of Computation。

Definition (Alphabet): 一個字母表(Alphabet)，記作 $\Sigma$，是一個有限非空集合。其元素 $s \in \Sigma$ 稱作符元 (symbols)

Example 1: $\Sigma := \{0,1\}$ 收集了兩個符元 $0$ 與 $1$，可用在表達二進制的世界。

Example 2: $\Sigma := \{a,b,c,\dots, z\}$ 為基礎英文字母表，收集了26 個符元 $a,b,c\dots,z$，可用在表達英文文字的世界。接著定義字串(string)。

嚴格來說字串是一組有限序列：

Definition (String): 一個在字母表 $\Sigma$ 上，具有長度為 $n$ 的字串(string)是一個函數 $w: \{1,\dots, n\} \to \Sigma$。一般而言寫我們將此字串記為序列 $$w :=c_1c_2 \dots c_n$$ 其中 $c_i := w(i)$ 為第 $i$ 個位置的符元。

Definition (Empty String) 存在一個唯一長度為 $0$ 的字串，記作 $\varepsilon$，代表空序列。

Definition (Concatenation): 令 $x:=x_1 x_2 \cdots x_n$ 且 $y:= y_1 \cdots y_m$ 分別為長度為 $n$ 與 $m$ 的兩個字串。我們定義串接(concatenation) $x \cdot y$ (或簡記為 $xy$ ) 為一個長度為 $n+m$ 的字串，其定義如下： \begin{align*} (xy)_k = \begin{cases} x_k & \text{if } 1 \le k \le n \\ y_{k-n} & \text{if } n < k \le n+m \end{cases} \end{align*}

Remark (Monoid Structure): 所有字串所成的集合為基於 concatenation 算子之下的么半群 (monoid)，亦即具備有可結合(associative)二元運算和單位元素(identity)的代數結構：

Associativity: 對任何字串 $x,y,z$而言，我們有 $ (xy)z = x(yz). $
Identity: 存在一個字串 $\varepsilon$ 使得 $ x \varepsilon = \varepsilon x = x. $

以下我們先看個例子。

Example: 考慮 $\Sigma :=\{ \texttt{a}, \cdots \texttt{z} \}$。並考慮字串 $x:= \texttt{no}$ (長度為 $n=2$) 與字串 $y:= \texttt{way}$ (長度為 $m=3$)。現在我們計算 concatenation $z:=x\cdot y$ 。新字串 $z$ 的長度為 $n+m = 2+3 = 5$。如果要找出新字串 $z$ 第1個到第5個元素分別為何，我們可以根據前述定義： \begin{align*} (xy)_k = \begin{cases} x_k & \text{if } 1 \le k \le 2 \\ y_{k-n} & \text{if } 2 < k \le 5 \end{cases} \end{align*}

當 $k=1:$ $1\leq 2$，則 $z_1= x_1 = \texttt{n}$
當 $k=2:$ $2\leq 2$，則 $z_2= x_2 = \texttt{o}$
當 $k=3:$ $3 > 2$，則 $z_3= y_{3-2} = y_1 = \texttt{w}$
當 $k=4:$ $4 > 2$，則 $z_4= y_{4-2} = y_2 = \texttt{a}$
當 $k=5:$ $5 > 2$，則 $z_5= y_{5-2} = y_3=\texttt{y}$

故我們得到 $z= x \cdot y = \texttt{noway}$

接著我們透過遞迴建構對所有可能的字串：

以下我們進一步建構形式語言：

Definition (Power of an Alphabet): 令 $\Sigma$ 為一個字母表。我們以歸納法定義 $\Sigma$ 的冪次 (powers) ，記作 $\Sigma^n$ 。定義如下：

$\Sigma^0 =\{\varepsilon\}$。亦即包含空字串的集合
$\Sigma^1 = \Sigma$
對任意 $n \geq 1$而言，定義 $\Sigma^{n+1} :=\{w x: w \in \Sigma^n, x \in \Sigma\}$

Definition (Kleene Star): 我們稱集合 $\Sigma^*$ 為 Kleene star，若該集合為 $\Sigma$ 所有冪次的可數無窮聯集，亦即 $$ \Sigma^* := \bigcup_{n=0}^\infty \Sigma^n = \Sigma^0 \cup (\Sigma \times \Sigma) \cup (\Sigma \times \Sigma \times \Sigma) \cup \dots $$ 換言之，$\Sigma^*$ 為由字母表 $\Sigma$ 所生成的所有有限字串的集合。

Remark: 注意到這邊使用的是 Kleene star $\Sigma^*$ 而非單一的 Cartesian product $\Sigma^n$。這是因為 Cartesian product $\Sigma^n = \Sigma \times \Sigma \times \cdots \Sigma$ 僅定義了固定長度 $n$ 的序列空間(類似固定維度的向量空間)。然而，形式語言必須包含任意有限長度字串，因此必須透過對所有可能的長度 $n$ 取聯集 $\cup_n^\infty \Sigma^n$ 來建構。

=======

The Tokenization Problem. 現在我們引入字彙集 (vocabulary) $\mathcal{V}$ ，並將 tokenization 定義為串接的反問題。

Definition (Vocabulary): 一個字彙集 vocabulary $\mathcal{V}$ 為 $\Sigma^*$ 的有限子集。亦即 $\mathcal{V} \subseteq \Sigma^*$ 且 $|\mathcal{V}| < \infty$

Definition (Valid Tokenization): 令 $s \in \Sigma^*$ 為目標字串。 $s$ 的 tokenization over $\mathcal{V}$ 為字串序列 $(t_1,t_2, \dots, t_k)$ 使得

1. Validity. 對任意 $i$, $t_i \in \mathcal{V}$

2. Reconstruction. $s = t_1 t_2 \cdots t_k$ (串接後還原為 $s$ )

Theorem (Existence of Tokenization): 令 $\Sigma$ 為字母表且 $\mathcal{V} \subseteq \Sigma^*$ 為字彙集。若 $\Sigma \subseteq \mathcal{V}$，則對任意字串 $s \in \Sigma^*$，必存在至少一個 tokenization of $s$ over $\mathcal{V}$。

Proof. 取任意字串 $s \in \Sigma^*$。由 $\Sigma^*$ 定義，存在 $n \in \mathbb{N}_0$ 使得 $s \in \Sigma^n$。以下我們分兩個情況討論：

Case 1. 若 $n=0$，則 $s = \varepsilon$。空序列即為一個 valid tokenization。

Case 2. 若 $n > 0$，根據 Powers of an Alphabet 的定義，字串 $s$ 可表示為序列 $c_1 c_2 \dots c_n$，其中每個 $c_i \in \Sigma$。由於前提可知 $\Sigma \subseteq \mathcal{V}$，故對所有 $i \in \{1, \dots, n\}$，我們有 $c_i \in \mathcal{V}$。現在建構序列 $$ T := (c_1, c_2, \dots, c_n) $$ 由於 $T$ 的每個元素都屬於 $\mathcal{V}$ 且其 concatenation 結果為 $s$，故 $T$ 為 valid tokenization。至此證畢。 $\square$

Corollary: 若 $\Sigma \not\subseteq \mathcal{V}$，則存在至少一個 $s \in \Sigma^*$ 使得其不存在 valid tokenization.

Example: 考慮$\Sigma = \{ \text{a}, \dots, \text{z} \}$ 為基礎英文字母表。設字彙集為 $\mathcal{V}$ 為一組英文單字：$\mathcal{V} := \{ \text{apple}, \text{banana} \}$。注意到此處 $\Sigma \not\subseteq \mathcal{V}$，例如 'c' $\in \Sigma$ 但是 'c' $\notin \mathcal{V}$。若現在輸入字串 $s = \text{cat}$，則 $s$ 沒有辦法 tokenize，因為無法由 $\mathcal{V}$ 中的元素串接而成。

謝宗翰的隨筆

2/05/2026

[機器學習] 關於 Tokenization

沒有留言:

張貼留言

[Claude] 國小數學加減乘除法計算小遊戲：數學怪獸大亂鬥

搜尋此網誌