2024-06-24碼農

10分鐘學會Python文本處理

在現代數據分析中，文本處理是一項至關重要的技能。無論是從事數據分析、機器學習，還是自然語言處理，文本處理都是我們需要掌握的基本功。透過文本處理，我們可以從海量的非結構化數據中提取有用的資訊，進行數據清洗和預處理，從而為後續的分析工作奠定基礎。本文將帶你在10分鐘內學會如何使用 Python 進行高效的文本處理。

1. 文本處理的重要性

文本數據無處不在，從社交媒體上的評論到企業日誌檔，文本數據占據了大部份非結構化數據的份額。據統計，非結構化數據占企業數據總量的80%以上，而其中的絕大部份是文本數據。文本處理的套用場景非常廣泛，例如：

• 數據分析 ：透過處理客戶反饋和評論，企業可以提取有價值的資訊，改進產品和服務。

• 自然語言處理 ：包括文本分類、情感分析、命名實體辨識等任務。

• 日誌分析 ：透過分析伺服器日誌，可以發現系統的潛在問題和安全漏洞。

• 數據清洗 ：在數據分析和機器學習中，清洗和預處理文本數據是一個關鍵步驟。

2. Python中處理文本的基本方法

2.1 字串操作

Python內建了豐富的字串操作方法，下面是一些常用的方法和範例：

字串的基本操作

# 定義一個字串text = "Hello, World!"# 轉換為大寫print(text.upper()) # 輸出: "HELLO, WORLD!"# 轉換為小寫print(text.lower()) # 輸出: "hello, world!"# 去除兩端的空格text_with_spaces = " Hello, World! "print(text_with_spaces.strip()) # 輸出: "Hello, World!"# 替換子字串print(text.replace("World", "Python")) # 輸出: "Hello, Python!"# 分割字串print(text.split(", ")) # 輸出: ['Hello', 'World!']

字串格式化

Python提供了多種字串格式化方法，例如 % 操作符、 str.format() 方法和 f-strings 。

name = "Alice"age = 30# 使用 % 操作符print("Name: %s, Age: %d" % (name, age))# 使用 str.format() 方法print("Name: {}, Age: {}".format(name, age))# 使用 f-strings (Python 3.6+)print(f"Name: {name}, Age: {age}")

2.2 正規表式

正規表式是一種強大的文本處理工具，用於匹配和操作字串。Python的 re 模組提供了對正規表式的支持。

匹配和尋找

import retext = "The price is $100. The discount is 20%."# 匹配價格price_pattern = r"\$\d+"price_match = re.search(price_pattern, text)if price_match: print(price_match.group()) # 輸出: "$100"# 尋找所有百分比discount_pattern = r"\d+%"discount_matches = re.findall(discount_pattern, text)print(discount_matches) # 輸出: ['20%']

替換和分割

# 替換所有的百分比為 "XX%"text_with_replaced_discounts = re.sub(r"\d+%", "XX%", text)print(text_with_replaced_discounts) # 輸出: "The price is $100. The discount is XX%."# 分割字串split_pattern = r"\s+"split_text = re.split(split_pattern, text)print(split_text) # 輸出: ['The', 'price', 'is', '$100.', 'The', 'discount', 'is', '20%.']

3. 使用Python讀寫檔

在實際的文本處理任務中，我們經常需要從檔中讀取數據並將處理結果寫入檔。Python提供了簡單而強大的檔讀寫方法。

3.1 讀取檔

# 讀取整個檔withopen("example.txt", "r") asfile:content = file.read() print(content)# 逐行讀取檔withopen("example.txt", "r") asfile:for line infile: print(line.strip())

3.2 寫入檔

# 寫入檔（覆蓋模式）withopen("output.txt", "w") asfile: file.write("Hello, World!")# 寫入檔（追加模式）withopen("output.txt", "a") asfile: file.write("\nHello, Python!")

4. 實用的文本處理範例

4.1 日誌分析

假設我們有一個伺服器日誌檔 server.log ，其中包含大量的日誌資訊。我們希望提取出所有的錯誤日誌並統計每種錯誤的出現次數。

import refrom collections import defaultdict# 初始化錯誤計數位典error_counts = defaultdict(int)# 錯誤日誌模式error_pattern = r"ERROR: (.+)"# 讀取日誌檔並統計錯誤with open("server.log", "r") as file:for line in file: error_match = re.search(error_pattern, line)if error_match: error_message = error_match.group(1) error_counts[error_message] += 1# 輸出錯誤統計資訊for error_message, count in error_counts.items(): print(f"{error_message}: {count}")

4.2 數據清洗

在數據分析和機器學習中，數據清洗是一個關鍵步驟。假設我們有一個包含使用者評論的數據檔 comments.txt ，我們希望去除評論中的特殊字元和多余的空格。

import re# 清洗評論函式defclean_comment(comment):# 去除特殊字元 comment = re.sub(r"[^a-zA-Z0-9\s]", "", comment)# 去除多余的空格 comment = re.sub(r"\s+", " ", comment)return comment.strip()# 讀取評論檔並清洗cleaned_comments = []with open("comments.txt", "r") as file:for line in file: cleaned_comment = clean_comment(line) cleaned_comments.append(cleaned_comment)# 輸出清洗後的評論for comment in cleaned_comments: print(comment)

5. 常見的文本處理問題和解決方案

5.1 編碼問題

在處理文字檔案時，我們經常會遇到編碼問題，特別是處理非 UTF-8 編碼的檔。Python的 open 函式允許我們指定檔的編碼。

# 讀取非 UTF-8 編碼的檔withopen("example.txt", "r", encoding="latin-1") asfile:content = file.read() print(content)

5.2 大檔處理

處理大檔時，我們需要考慮記憶體的使用情況。逐行讀取檔是一種有效的方法。

# 逐行讀取大檔withopen("large_file.txt", "r") asfile:for line infile:# 處理每一行 pass

5.3 多執行緒處理

對於一些復雜的文本處理任務，我們可以使用多執行緒來提高處理速度。

import threading# 定義執行緒處理常式def process_lines(lines): for line in lines:# 處理每一行 pass# 讀取檔並分塊處理withopen("large_file.txt", "r") asfile:lines = file.readlines()# 分塊處理chunk_size = len(lines) // 4threads = []for i inrange(4):start = i * chunk_sizeend = (i + 1) * chunk_size if i != 3elselen(lines)thread = threading.Thread(target=process_lines, args=(lines[start:end],)) threads.append(thread) thread.start()# 等待所有執行緒完成forthreadin threads: thread.join()

5.4 文本分類與情感分析

在自然語言處理（NLP）領域，文本分類和情感分析是兩個常見的任務。我們可以使用 scikit-learn 和 nltk 庫來實作這些任務。

文本分類範例

from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score# 範例文本數據texts = ["I love this product", "This is the worst service ever", "Amazing experience", "Not good at all"]labels = [1, 0, 1, 0] # 1 表示正面評價，0 表示負面評價# 文本向量化vectorizer = CountVectorizer()X = vectorizer.fit_transform(texts)# 分割訓練集和測試集X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.25, random_state=42)# 訓練樸素貝葉斯分類器 classifier = MultinomialNB() classifier.fit(X_train, y_train)# 測試分類器y_pred = classifier.predict(X_test)accuracy = accuracy_score(y_test, y_pred)print(f"分類準確率: {accuracy}")

情感分析範例

from nltk.sentiment.vader import SentimentIntensityAnalyzer# 初始化情感分析器sia = SentimentIntensityAnalyzer()# 範例文本texts = ["I love this product", "This is the worst service ever", "Amazing experience", "Not good at all"]# 進行情感分析for text in texts: sentiment = sia.polarity_scores(text) print(f"文本: {text}") print(f"情感分數: {sentiment}")

總結

透過本文的介紹，我們了解了文本處理的重要性，並掌握了使用 Python 進行文本處理的基本方法。從字串操作和正規表式，到檔讀寫，再到實際的文本處理範例和常見問題的解決方案，希望這些內容能幫助你快速上手 Python 文本處理。

如果喜歡我的內容，不妨點贊關註，我們下次再見！

大家註意：因為微信最近又改了推播機制，經常有小夥伴說錯過了之前被刪的文章，或者一些限時福利，錯過了就是錯過了。所以建議大家加個星標，就能第一時間收到推播。

點個喜歡支持我吧，點個在看就更好了