出乎意料：怎麽中文也屬於字母？

2024-03-18碼農

來源：未聞Code

我最近在使用一個第三方庫，叫做 RapidFuzz 。它有一個工具函式，叫做 utils.default_process ，在官方文件裏面，是這樣介紹的：

紅色方框裏面說，這個函式可以移除所有的非 alphanumeric 字元。如果我們使用轉譯軟體，會發現 alphanumeric 的意思是字母和數位。如下圖所示：

因此，我想當然覺得，這個功能函式，只會保留26個英文字母的大小寫加上10個數位，一共62個字元。把除此之外的所有其他字元都移除掉。

但我經過測試，它竟然沒有辦法過濾掉中文字元，如下圖所示。難道終於也屬於字母？

於是我到Github上面去給這個計畫提Issue。但作者卻說這個函式沒有問題，並且使用Python的 .isalnum() 來做測試，發現Python也會認為中文也是 alphanumeric 。如下圖所示：

這就非常奇怪了，於是我找到Python官方文件，發現它是這樣說的：

str.isalnum() ^[1]

Return True if all characters in the string are alphanumeric and there is at least one character, False otherwise. A character c is alphanumeric if one of the following returns True : c.isalpha() , c.isdecimal() , c.isdigit() , or c.isnumeric() .

說明 '中文'.isalnum() 返回 True ，顯然是因為 '中文'.isalpha() 返回了 True 。而之所以 .isalpha() 會返回 True ，是因為它判斷的不僅僅是英文字母，而是所有Unicode裏面，類別為 letter 的字元：

str.isalpha() ^[2]

Return True if all characters in the string are alphabetic and there is at least one character, False otherwise. Alphabetic characters are those characters defined in the Unicode character database as 「Letter」, i.e., those with general category property being one of 「Lm」, 「Lt」, 「Lu」, 「Ll」, or 「Lo」.

在Unicode標準網站 UAX #44: Unicode Character Database ^[3] 上面，可以看到它這裏定義的 Lm 、 Lt 、 Lu 、 Ll 和 Lo 的意思：

我們使用Python內建的 unicodedata 模組，可以看到中文字元的型別，確實是 Lo ，如下圖所示：

所以， '中文'.isalpha() 返回 True 確實是合理的。

以後看到 alphanumeric ，再也不要以為只有62個字元了。

參考資料

[1]

Link to this definition: https://docs.python.org/3/library/stdtypes.html#str.isalnum

[2]

Link to this definition: https://docs.python.org/3/library/stdtypes.html#str.isalpha

[3]

UAX #44: Unicode Character Database: https://unicode.org/reports/tr44/#General_Category_Values

以上是今天分享的內容。提醒一下，Python貓的贈書【流暢的Python】活動仍在進行，不要錯過啦！-->

圖書購買連結