【RAG入門教程04】Langchian的文件切分

2024-06-09碼農

在 Langchain 中，文件轉換器是一種在將文件提供給其他 Langchain 元件之前對其進行處理的工具。透過清理、處理和轉換文件，這些工具可確保 LLM 和其他 Langchain 元件以最佳化其效能的格式接收數據。

上一章我們了解了文件載入器，載入完文件之後還需要對文件進行轉換。

文本分割器

整合

unset unset Text Splitters unset unset

文本分割器專門用於將文本文件分割成更小、更易於管理的單元。

理想情況下，這些塊應該是句子或段落，以便理解文本中的上下文和關系。

分割器考慮了 LLM 處理能力的局限性。透過建立更小的塊，LLM 可以在其上下文視窗內更有效地分析資訊。

CharacterTextSplitter

RecursiveCharacterTextSplitter

Split by tokens

Semantic Chunking

HTMLHeaderTextSplitter

MarkdownHeaderTextSplitter

RecursiveJsonSplitter

Split Cod

unset unset CharacterTextSplitter unset unset

from langchain_text_splitters import CharacterTextSplitter text_splitter = CharacterTextSplitter( separator="\n\n", chunk_size=1000, chunk_overlap=200, length_function=len, is_separator_regex=False, )

separator ：這是用於標識文本中自然斷點的分隔符。在本例中，它被設定為「\n\n」，這意味著分割器將尋找雙換行符作為潛在的分割點。

chunk_size ：此參數指定每個文本塊的目標大小，以字元數表示。在這裏，它被設定為 1000，這意味著分割器將旨在建立大約 1000 個字元長的文本塊。

chunk_overlap ：此參數允許連續塊之間重疊字元。它被設定為 200，這意味著每個塊將包含前一個塊末尾的 200 個字元。這種重疊可以幫助確保在塊之間的邊界上不會遺失任何重要資訊。

length_function ：這是一個用於測量文本塊長度的函式。在本例中，它被設定為內建的 len 函式，該函式計算字串中的字元數。

is_separator_regex ：此參數指定分隔符是否為正規表式。它被設定為 False，表示分隔符是一個純字串，而不是正規表式模式。

CharacterTextSplitter 根據指定的分隔符拆分文本，預設情況下分隔符設定為 '\n\n'。 chunk_size 參數確定每個塊的最大大小，並且只有在可行的情況下才會進行拆分。如果字串以 n 個字元開頭，後跟一個分隔符，然後在下一個分隔符之前有 m 個字元，則如果 chunk_size 小於 n + m + len(separator)，則第一個塊的大小將為 n。

from langchain_community.document_loaders import PyPDFLoader loader = PyPDFLoader("book.pdf") pages = loader.load_and_split() from langchain_text_splitters import CharacterTextSplitter text_splitter = CharacterTextSplitter( separator="\n", chunk_size=1000, chunk_overlap=200, length_function=len, is_separator_regex=False, ) texts = text_splitter.split_text(pages[0].page_content) print(len(texts)) # 4 texts[0] """ 'Our goal with this book is to provide the guidance and framework for you, the reader, to grow on \nthe path to being a truly excellent database reliability engineer (DBRE). When naming the book we \nchose to use the words reliability engineer , rather than administrator. \nBen Treynor, VP of Engineering at Google, says the following about reliability engi‐ neering: \nfundamentally doing work that has historically been done by an operations team, but using engineers with software \nexpertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, \nsubstitute automation for human labor. \nToday’s database professionals must be engineers, not administrators. We build things. We create \nthings. As engineers practicing devops, we are all in this together, and nothing is someone else’s \nproblem. As engineers, we apply repeatable processes, establ ished knowledge, and expert judgment' """ texts[1] """ 'things. As engineers practicing devops, we are all in this together, and nothing is someone else’s \nproblem. As engineers, we apply repeatable processes, establ ished knowledge, and expert judgment \nto design, build, and operate production data stores and the data structures within. As database \nreliability engineers, we must take the operational principles and the depth of database expertise \nthat we possess one ste p further. \nIf you look at the non -storage components of today’s infrastructures, you will see sys‐ tems that are \neasily built, run, and destroyed via programmatic and often automatic means. The lifetimes of these \ncomponents can be measured in days, and sometimes even hours or minutes. When one goes away, \nthere is any number of others to step in and keep the quality of service at expected levels. \nOur next goal is that you gain a framework of principles and practices for the design, building, and' """

unset unset RecursiveCharacterTextSplitter unset unset

關鍵區別在於，如果結果塊仍然大於所需的 chunk_size，它將繼續分割結果塊，以確保所有最終塊都在指定的大小限制內。它由字元列表參數化。

from langchain_text_splitters import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( # Set a really small chunk size, just to show. separators=["\n\n", "\n", " ", ""], chunk_size=50, chunk_overlap=40, length_function=len, is_separator_regex=False, ) texts = text_splitter.split_text(pages[0].page_content) print(len(texts)) texts[2] """ 'book is to provide the guidance and framework for' """ texts[3] """ 'provide the guidance and framework for you, the' """

在文本拆分的上下文中，「遞迴」意味著拆分器將重復將其拆分邏輯套用於生成的塊，直到它們滿足某些標準，例如小於指定的最大長度。這在處理需要分解成更小、更易於管理的片段（可能在不同的粒度級別）的非常長的文本時特別有用。

unset unset Split By Tokens unset unset

原文：「The quick brown fox jumps over the lazy dog。」

標記：[「The」、「quick」、「brown」、「fox」、「jumps」、「over」、「the」、「lazy」、「dog」]

在此範例中，文本根據空格和標點符號拆分為標記。每個單詞都成為單獨的標記。在實踐中，標記化可能更復雜，尤其是對於具有不同書寫系統的語言或處理特殊情況（例如，「don’t」可能拆分為「do」和「n’t」）。

有各種標記器。

TokenTextSplitter 來自 tiktoken 庫。

from langchain_text_splitters import TokenTextSplitter text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=1) texts = text_splitter.split_text(pages[0].page_content) texts[0] """ 'Our goal with this book is to provide the guidance' """ texts[1] """ ' guidance and framework for you, the reader, to' """

SpacyTextSplitter 來自 spacy 庫。

from langchain_text_splitters import SpacyTextSplitter text_splitter = SpacyTextSplitter(chunk_size=1000) texts = text_splitter.split_text(pages[0].page_content)

NLTKTextSplitter 來自 nltk 庫。

from langchain_text_splitters import NLTKTextSplitter text_splitter = NLTKTextSplitter(chunk_size=1000) texts = text_splitter.split_text(pages[0].page_content)

我們甚至可以利用 Hugging Face 標記器。

from transformers import GPT2TokenizerFast tokenizer = GPT2TokenizerFast.from_pretrained("gpt2") text_splitter = CharacterTextSplitter.from_huggingface_tokenizer( tokenizer, chunk_size=100, chunk_overlap=10 ) texts = text_splitter.split_text(pages[0].page_content)

unset unset HTMLHeaderTextSplitter unset unset

HTMLHeaderTextSplitter是一個網頁代分碼塊器，它根據 HTML 元素拆分文本，並將相關後設資料分配給分塊內的每個檔頭。它可以返回單個分塊或將具有相同後設資料的元素組合在一起，以保持語意分組並保留文件的結構上下文。此拆分器可與分塊管道中的其他文本拆分器結合使用。

from langchain_text_splitters import HTMLHeaderTextSplitter html_string = """ <!DOCTYPE html> <html> <body> <div> <h1>Foo</h1> <p>Some intro text about Foo.</p> <div> <h2>Bar main p</h2> <p>Some intro text about Bar.</p> <h3>Bar subp 1</h3> <p>Some text about the first subtopic of Bar.</p> <h3>Bar subp 2</h3> <p>Some text about the second subtopic of Bar.</p> </div> <div> <h2>Baz</h2> <p>Some text about Baz</p> </div> <br> <p>Some concluding text about Foo</p> </div> </body> </html> """ headers_to_split_on = [ ("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3"), ] html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on) html_header_splits = html_splitter.split_text(html_string) html_header_splits """ [Document(page_content='Foo'), Document(page_content='Some intro text about Foo. \nBar main p Bar subp 1 Bar subp 2', metadata={'Header 1': 'Foo'}), Document(page_content='Some intro text about Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main p'}), Document(page_content='Some text about the first subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main p', 'Header 3': 'Bar subp 1'}), Document(page_content='Some text about the second subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main p', 'Header 3': 'Bar subp 2'}), Document(page_content='Baz', metadata={'Header 1': 'Foo'}), Document(page_content='Some text about Baz', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}), Document(page_content='Some concluding text about Foo', metadata={'Header 1': 'Foo'})] """

unset unset MarkdownHeaderTextSplitter unset unset

類似於 HTMLHeaderTextSplitter ，專用於 markdown 檔。

from langchain_text_splitters import MarkdownHeaderTextSplitter markdown_document = "# Foo\n\n ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly" headers_to_split_on = [ ("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3"), ] markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on) md_header_splits = markdown_splitter.split_text(markdown_document) md_header_splits """ [Document(page_content='Hi this is Jim \nHi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}), Document(page_content='Hi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}), Document(page_content='Hi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})] """

unset unset RecursiveJsonSplitter unset unset

import requests # This is a large nested json object and will be loaded as a python dict json_data = requests.get("https://api.smith.langchain.com/openapi.json").json() from langchain_text_splitters import RecursiveJsonSplitter splitter = RecursiveJsonSplitter(max_chunk_size=300) # Recursively split json data - If you need to access/manipulate the smaller json chunks json_chunks = splitter.split_json(json_data=json_data) json_chunks """ {'openapi': '3.0.2', 'info': {'title': 'LangSmith', 'version': '0.1.0'}, 'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': ['tracer-sessions'], 'summary': 'Read Tracer Session', 'description': 'Get a specific session.'}}}}, {'paths': {'/api/v1/sessions/{session_id}': {'get': {'operationId': 'read_tracer_session_api_v1_sessions__session_id__get'}}}}, {'paths': {'/api/v1/sessions/{session_id}': {'get': {'parameters': [{'required': True, 'schema': {'title': 'Session Id', 'type': 'string', 'format': 'uuid'}, 'name': 'session_id', 'in': 'path'}, {'required': False, 'schema': {'title': 'Include Stats', 'type': 'boolean', 'default': False}, 'name': 'include_stats', 'in': 'query'}, {'required': False, 'schema': {'title': 'Accept', 'type': 'string'}, 'name': 'accept', 'in': 'header'}]}}}}, {'paths': {'/api/v1/sessions/{session_id}': {'get': {'responses': {'200': {'description': 'Successful Response', 'content': {'application/json': {'schema': {'$ref': '#/components/schemas/TracerSession'}}}}}}}}}, {'paths': {'/api/v1/sessions/{session_id}': {'get': {'responses': {'422': {'description': 'Validation Error', 'content': {'application/json': {'schema': {'$ref': '#/components/schemas/HTTPValidationError'}}}}}, 'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}]}}}}, ... {'components': {'securitySchemes': {'API Key': {'type': 'apiKey', 'in': 'header', 'name': 'X-API-Key'}, 'Tenant ID': {'type': 'apiKey', 'in': 'header', 'name': 'X-Tenant-Id'}, 'Bearer Auth': {'type': 'http', 'scheme': 'bearer'}}}}] """

unset unset Split Code unset unset

Langchain 中的「Split Code」概念是指將程式碼劃分為更小、更易於管理的段或塊的過程。

from langchain_text_splitters import Language [e.value for e in Language] """ ['cpp', 'go', 'java', 'kotlin', 'js', 'ts', 'php', 'proto', 'python', 'rst', 'ruby', 'rust', 'scala', 'swift', 'markdown', 'latex', 'html', 'sol', 'csharp', 'cobol', 'c', 'lua', 'perl'] """

from langchain_text_splitters import ( Language, RecursiveCharacterTextSplitter, ) PYTHON_CODE = """ def hello_world(): print("Hello, World!") # Call the function hello_world() " "" python_splitter = RecursiveCharacterTextSplitter.from_language( language=Language.PYTHON, chunk_size=50, chunk_overlap=0 ) python_docs = python_splitter.create_documents([PYTHON_CODE]) python_docs """ [Document(page_content='def hello_world():\n print("Hello, World!")'), Document(page_content='# Call the function\nhello_world()')] """ JS_CODE = """ function helloWorld() { console.log("Hello, World!"); } // Call the function helloWorld(); " ""

js_splitter = RecursiveCharacterTextSplitter.from_language( language=Language.JS, chunk_size=60, chunk_overlap=0 ) js_docs = js_splitter.create_documents([JS_CODE]) js_docs """ [Document(page_content='function helloWorld() {\n console.log("Hello, World!");\n}'), Document(page_content='// Call the function\nhelloWorld();')] """