当前位置：首页 > news >正文

小北的字节跳动青训营与LangChain实战课：深入探索输出解析器与Pydantic解析器重构（持续更新中~~~）

news 2025/4/26 18:37:50

前言

最近，字节跳动的青训营再次扬帆起航，作为第二次参与其中的小北，深感荣幸能借此机会为那些尚未了解青训营的友友们带来一些详细介绍。青训营不仅是一个技术学习与成长的摇篮，更是一个连接未来与梦想的桥梁~

小北的青训营 X MarsCode 技术训练营——AI 加码，字节跳动青训营入营考核解答（持续更新中~~~）-CSDN博客https://blog.csdn.net/Zhiyilang/article/details/143384787?sharetype=blogdetail&sharerId=143384787&sharerefer=PC&sharesource=Zhiyilang&spm=1011.2480.3001.8118编辑https://blog.csdn.net/Zhiyilang/article/details/143384787?sharetype=blogdetail&sharerId=143384787&sharerefer=PC&sharesource=Zhiyilang&spm=1011.2480.3001.8118https://blog.csdn.net/Zhiyilang/article/details/143384787?sharetype=blogdetail&sharerId=143384787&sharerefer=PC&sharesource=Zhiyilang&spm=1011.2480.3001.8118https://blog.csdn.net/Zhiyilang/article/details/143384787?sharetype=blogdetail&sharerId=143384787&sharerefer=PC&sharesource=Zhiyilang&spm=1011.2480.3001.8118https://blog.csdn.net/Zhiyilang/article/details/143384787?sharetype=blogdetail&sharerId=143384787&sharerefer=PC&sharesource=Zhiyilang&spm=1011.2480.3001.8118https://blog.csdn.net/Zhiyilang/article/details/143384787?sharetype=blogdetail&sharerId=143384787&sharerefer=PC&sharesource=Zhiyilang&spm=1011.2480.3001.8118https://blog.csdn.net/Zhiyilang/article/details/143384787?sharetype=blogdetail&sharerId=143384787&sharerefer=PC&sharesource=Zhiyilang&spm=1011.2480.3001.8118

小北的字节跳动青训营与 LangChain 实战课：探索 AI 技术的新边界（持续更新中~~~）-CSDN博客编辑https://blog.csdn.net/Zhiyilang/article/details/143454165https://blog.csdn.net/Zhiyilang/article/details/143454165https://blog.csdn.net/Zhiyilang/article/details/143454165https://blog.csdn.net/Zhiyilang/article/details/143454165https://blog.csdn.net/Zhiyilang/article/details/143454165https://blog.csdn.net/Zhiyilang/article/details/143454165
小北的字节跳动青训营与LangChain系统安装和快速入门学习（持续更新中~~~）。-CSDN博客编辑https://blog.csdn.net/Zhiyilang/article/details/143455380https://blog.csdn.net/Zhiyilang/article/details/143455380https://blog.csdn.net/Zhiyilang/article/details/143455380https://blog.csdn.net/Zhiyilang/article/details/143455380https://blog.csdn.net/Zhiyilang/article/details/143455380https://blog.csdn.net/Zhiyilang/article/details/143455380

小北的字节跳动青训营用LangChain打造“易速鲜花”内部员工知识库问答系统（持续更新中~）-CSDN博客编辑https://blog.csdn.net/Zhiyilang/article/details/143456544https://blog.csdn.net/Zhiyilang/article/details/143456544https://blog.csdn.net/Zhiyilang/article/details/143456544https://blog.csdn.net/Zhiyilang/article/details/143456544https://blog.csdn.net/Zhiyilang/article/details/143456544https://blog.csdn.net/Zhiyilang/article/details/143456544小北的字节跳动青训营与提示工程（上）：用少样本FewShotTemplate和ExampleSelector创建应景文案（持续更新中~）-CSDN博客https://blog.csdn.net/Zhiyilang/article/details/143468624?sharetype=blogdetail&sharerId=143468624&sharerefer=PC&sharesource=Zhiyilang&spm=1011.2480.3001.8118https://blog.csdn.net/Zhiyilang/article/details/143468624?sharetype=blogdetail&sharerId=143468624&sharerefer=PC&sharesource=Zhiyilang&spm=1011.2480.3001.8118https://blog.csdn.net/Zhiyilang/article/details/143468624?sharetype=blogdetail&sharerId=143468624&sharerefer=PC&sharesource=Zhiyilang&spm=1011.2480.3001.8118小北的字节跳动青训营与提示工程（下）：用思维链和思维树提升模型思考质量（持续更新中~）-CSDN博客https://blog.csdn.net/Zhiyilang/article/details/143494432https://blog.csdn.net/Zhiyilang/article/details/143494432小北的字节跳动青训营与调用模型：调用模型：OpenAI API vs 微调开源Llama2/ChatGLM（持续更新中~）-CSDN博客https://blog.csdn.net/Zhiyilang/article/details/143495077哈喽哈喽，这里是是zyll~,北浊.欢迎来到小北的 LangChain 实战课学习笔记！

在这个充满变革的时代，技术的每一次进步都在推动着世界的快速发展。在上一课中，我们学习了如何为一些花和价格生成吸引人的描述，并将这些描述和原因存储到一个CSV文件中。为了实现这个目标，我们调用了OpenAI模型，并利用了结构化输出解析器，以及一些数据处理和存储的工具。今天，我们将进一步深入研究LangChain中的输出解析器，并用一个新的解析器——Pydantic解析器来重构第5课中的程序。这也是模型I/O框架的最后一讲。

LangChain中的输出解析器

语言模型输出的是文本，这是给人类阅读的。但很多时候，你可能想要获得的是程序能够处理的结构化信息。这就是输出解析器发挥作用的地方。

输出解析器是一种专用于处理和构建语言模型响应的类。一个基本的输出解析器类通常需要实现以下核心方法：

get_format_instructions：返回一个字符串，用于指导如何格式化语言模型的输出，告诉它应该如何组织并构建它的回答。
parse：接收一个字符串（语言模型的输出）并将其解析为特定的数据结构或格式。这一步通常用于确保模型的输出符合我们的预期，并且能够以我们需要的形式进行后续处理。

parse_with_prompt（可选）：接收一个字符串（语言模型的输出）和一个提示（用于生成这个输出的提示），并将其解析为特定的数据结构。这样，你可以根据原始提示来修正或重新解析模型的输出，确保输出的信息更加准确和贴合要求。

下面是一个基于上述描述的简单伪代码示例：

class OutputParser:def __init__(self):passdef get_format_instructions(self):# 返回一个字符串，指导如何格式化模型的输出passdef parse(self, model_output):# 解析模型的输出，转换为某种数据结构或格式passdef parse_with_prompt(self, model_output, prompt):# 基于原始提示解析模型的输出，转换为某种数据结构或格式pass

LangChain提供了多种输出解析器，包括：

列表解析器：用于处理模型生成的列表输出。
日期时间解析器：用于处理日期和时间相关的输出。
枚举解析器：用于处理预定义的一组值。
结构化输出解析器：用于处理复杂的、结构化的输出。
Pydantic（JSON）解析器：用于处理符合特定格式的JSON对象输出。
自动修复解析器：可以自动修复某些常见的模型输出错误。
重试解析器：在模型的初次输出不符合预期时，尝试修复或重新生成新的输出。

上面的各种解析器中，前三种很容易理解，而结构化输出解析器你已经用过了。所以接下来我们重点讲一讲Pydantic（JSON）解析器、自动修复解析器和重试解析器。

Pydantic（JSON）解析器实战

Pydantic (JSON) 解析器应该是最常用也是最重要的解析器，我带着你用它来重构鲜花文案生成程序。

Pydantic 是一个 Python 数据验证和设置管理库，主要基于 Python 类型提示。尽管它不是专为 JSON 设计的，但由于 JSON 是现代 Web 应用和 API 交互中的常见数据格式，Pydantic 在处理和验证 JSON 数据时特别有用。

接下来，我们将使用Pydantic解析器来重构鲜花文案生成程序。

第一步：创建模型实例

先通过环境变量设置OpenAI API密钥，然后使用LangChain库创建一个OpenAI的模型实例，选择text-davinci-003作为大语言模型。

# ------Part 1
# 设置OpenAI API密钥
import os
os.environ["OPENAI_API_KEY"] = '你的OpenAI API Key'# 创建模型实例
from langchain import OpenAI
model = OpenAI(model_name='gpt-3.5-turbo-instruct')

第二步：定义输出数据的格式

创建一个空的DataFrame，用于存储从模型生成的描述。然后，通过定义一个名为FlowerDescription的Pydantic BaseModel类，来指定期望的数据格式（即数据的结构）。

# ------Part 2
# 创建一个空的DataFrame用于存储结果
import pandas as pd
df = pd.DataFrame(columns=["flower_type", "price", "description", "reason"])# 数据准备
flowers = ["玫瑰", "百合", "康乃馨"]
prices = ["50", "30", "20"]# 定义我们想要接收的数据格式
from pydantic import BaseModel, Field
class FlowerDescription(BaseModel):flower_type: str = Field(description="鲜花的种类")price: int = Field(description="鲜花的价格")description: str = Field(description="鲜花的描述文案")reason: str = Field(description="为什么要这样写这个文案")

Pydantic的特点包括：

数据验证：自动验证输入数据是否符合指定的类型和其他验证条件。
数据转换：可以自动进行数据转换，例如将字符串转换为整数。
易于使用：只需使用Python的类型注解功能，即可在类定义中指定每个字段的类型。
JSON支持：可以很容易地从JSON数据创建Pydantic类实例，并可以将类的数据转换为JSON格式。

第三步：创建输出解析器

使用LangChain库中的PydanticOutputParser创建输出解析器，该解析器将用于解析模型的输出，以确保其符合FlowerDescription的格式。然后，使用解析器的get_format_instructions方法获取输出格式的指示。

# ------Part 3
# 创建输出解析器
from langchain.output_parsers import PydanticOutputParser
output_parser = PydanticOutputParser(pydantic_object=FlowerDescription)# 获取输出格式指示
format_instructions = output_parser.get_format_instructions()
# 打印提示
print("输出格式：",format_instructions)

程序输出如下：

输出格式： The output should be formatted as a JSON instance that conforms to the JSON schema below.As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.Here is the output schema:{"properties": {"flower_type": {"title": "Flower Type", "description": "\u9c9c\u82b1\u7684\u79cd\u7c7b", "type": "string"}, "price": {"title": "Price", "description": "\u9c9c\u82b1\u7684\u4ef7\u683c", "type": "integer"}, "description": {"title": "Description", "description": "\u9c9c\u82b1\u7684\u63cf\u8ff0\u6587\u6848", "type": "string"}, "reason": {"title": "Reason", "description": "\u4e3a\u4ec0\u4e48\u8981\u8fd9\u6837\u5199\u8fd9\u4e2a\u6587\u6848", "type": "string"}}, "required": ["flower_type", "price", "description", "reason"]}

下面，我们会把这个内容也传输到模型的提示中，让输入模型的提示和输出解析器的要求相互吻合，前后就呼应得上。

第四步：创建提示模板

定义一个提示模板，该模板将用于为模型生成输入提示。模板中包含需要模型填充的变量（如价格和花的种类），以及之前获取的输出格式指示。

# ------Part 4
# 创建提示模板
from langchain import PromptTemplate
prompt_template = """您是一位专业的鲜花店文案撰写员。
对于售价为 {price} 元的 {flower} ，您能提供一个吸引人的简短中文描述吗？
{format_instructions}"""# 根据模板创建提示，同时在提示中加入输出解析器的说明
prompt = PromptTemplate.from_template(prompt_template, partial_variables={"format_instructions": format_instructions}) # 打印提示
print("提示：", prompt)

输出：

提示： 
input_variables=['flower', 'price'] output_parser=None partial_variables={'format_instructions': 'The output should be formatted as a JSON instance that conforms to the JSON schema below.\n\n
As an example, for the schema {
"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, 
"required": ["foo"]}}\n
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. 
The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.\n\n
Here is the output schema:\n```\n
{"properties": {
"flower_type": {"title": "Flower Type", "description": "\\u9c9c\\u82b1\\u7684\\u79cd\\u7c7b", "type": "string"}, 
"price": {"title": "Price", "description": "\\u9c9c\\u82b1\\u7684\\u4ef7\\u683c", "type": "integer"}, 
"description": {"title": "Description", "description": "\\u9c9c\\u82b1\\u7684\\u63cf\\u8ff0\\u6587\\u6848", "type": "string"}, 
"reason": {"title": "Reason", "description": "\\u4e3a\\u4ec0\\u4e48\\u8981\\u8fd9\\u6837\\u5199\\u8fd9\\u4e2a\\u6587\\u6848", "type": "string"}}, 
"required": ["flower_type", "price", "description", "reason"]}\n```'} template='您是一位专业的鲜花店文案撰写员。
\n对于售价为 {price} 元的 {flower} ，您能提供一个吸引人的简短中文描述吗？\n
{format_instructions}' template_format='f-string' validate_template=True

总的来说，这个提示模板是一个用于生成模型输入的工具。你可以在模板中定义需要的输入变量，以及模板字符串的格式和结构，然后使用这个模板来为每种鲜花生成一个描述。

后面，我们还要把实际的信息，循环传入提示模板，生成一个个的具体提示。下面让我们继续。

第五步：生成提示，传入模型并解析输出

循环处理所有的花和它们的价格。对于每种花，根据提示模板创建输入，然后获取模型的输出。使用之前创建的解析器来解析输出，并将解析后的输出添加到DataFrame中。

# ------Part 5
for flower, price in zip(flowers, prices):# 根据提示准备模型的输入input = prompt.format(flower=flower, price=price)# 打印提示print("提示：", input)# 获取模型的输出output = model(input)# 解析模型的输出parsed_output = output_parser.parse(output)parsed_output_dict = parsed_output.dict()  # 将Pydantic格式转换为字典# 将解析后的输出添加到DataFrame中df.loc[len(df)] = parsed_output.dict()# 打印字典
print("输出的数据：", df.to_dict(orient='records'))

具体来说，输出的一个提示是这样的：

提示：您是一位专业的鲜花店文案撰写员。对于售价为 20 元的康乃馨，您能提供一个吸引人的简短中文描述吗？

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}}

the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:

{"properties": {"flower_type": {"title": "Flower Type", "description": "\u9c9c\u82b1\u7684\u79cd\u7c7b", "type": "string"}, "price": {"title": "Price", "description": "\u9c9c\u82b1\u7684\u4ef7\u683c", "type": "integer"}, "description": {"title": "Description", "description": "\u9c9c\u82b1\u7684\u63cf\u8ff0\u6587\u6848", "type": "string"}, "reason": {"title": "Reason", "description": "\u4e3a\u4ec0\u4e48\u8981\u8fd9\u6837\u5199\u8fd9\u4e2a\u6587\u6848", "type": "string"}}, "required": ["flower_type", "price", "description", "reason"]}

下面，程序解析模型的输出。在这一步中，你使用你之前定义的输出解析器（output_parser）将模型的输出解析成了一个FlowerDescription的实例。FlowerDescription是你之前定义的一个Pydantic类，它包含了鲜花的类型、价格、描述以及描述的理由。

然后，将解析后的输出添加到DataFrame中。在这一步中，你将解析后的输出（即FlowerDescription实例）转换为一个字典，并将这个字典添加到你的DataFrame中。这个DataFrame是你用来存储所有鲜花描述的。

最后，打印出所有的结果，并可以选择将其保存到CSV文件中。

输出的数据： 
[{'flower_type': 'Rose', 'price': 50, 'description': '玫瑰是最浪漫的花，它具有柔和的粉红色，有着浓浓的爱意，价格实惠，50元就可以拥有一束玫瑰。', 'reason': '玫瑰代表着爱情，是最浪漫的礼物，以实惠的价格，可以让您尽情体验爱的浪漫。'}, 
{'flower_type': '百合', 'price': 30, 'description': '这支百合，柔美的花蕾，在你的手中摇曳，仿佛在与你深情的交谈', 'reason': '营造浪漫氛围'}, 
{'flower_type': 'Carnation', 'price': 20, 'description': '艳丽缤纷的康乃馨，带给你温馨、浪漫的气氛，是最佳的礼物选择！', 'reason': '康乃馨是一种颜色鲜艳、芬芳淡雅、具有浪漫寓意的鲜花，非常适合作为礼物，而且20元的价格比较实惠。'}]

自动修复解析器（OutputFixingParser）实战

自动修复解析器主要用于纠正小的格式错误。当输出格式不正确时，它会尝试修复格式错误，而不是重新生成输出。

首先，让我们来设计一个解析时出现的错误。

# 导入所需要的库和模块
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from typing import List# 使用Pydantic创建一个数据格式，表示花
class Flower(BaseModel):name: str = Field(description="name of a flower")colors: List[str] = Field(description="the colors of this flower")
# 定义一个用于获取某种花的颜色列表的查询
flower_query = "Generate the charaters for a random flower."# 定义一个格式不正确的输出
misformatted = "{'name': '康乃馨', 'colors': ['粉红色','白色','红色','紫色','黄色']}"# 创建一个用于解析输出的Pydantic解析器，此处希望解析为Flower格式
parser = PydanticOutputParser(pydantic_object=Flower)
# 使用Pydantic解析器解析不正确的输出
parser.parse(misformatted)

这段代码如果运行，会出现错误。

langchain.schema.output_parser.OutputParserException: Failed to parse Flower from completion {'name': '康乃馨', 'colors': ['粉红色','白色']}. Got: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

不过，这里我并不想这样解决问题，而是尝试使用OutputFixingParser来帮助咱们自动解决类似的格式错误。

# 从langchain库导入所需的模块
from langchain.chat_models import ChatOpenAI
from langchain.output_parsers import OutputFixingParser# 设置OpenAI API密钥
import os
os.environ["OPENAI_API_KEY"] = '你的OpenAI API Key'# 使用OutputFixingParser创建一个新的解析器，该解析器能够纠正格式不正确的输出
new_parser = OutputFixingParser.from_llm(parser=parser, llm=ChatOpenAI())# 使用新的解析器解析不正确的输出
result = new_parser.parse(misformatted) # 错误被自动修正
print(result) # 打印解析后的输出结果

用上面的新的new_parser来代替Parser进行解析，你会发现，JSON格式的错误问题被解决了，程序不再出错。

输出如下：

name='Rose' colors=['red', 'pink', 'white']

这里的秘密在于，在OutputFixingParser内部，调用了原有的PydanticOutputParser，如果成功，就返回；如果失败，它会将格式错误的输出以及格式化的指令传递给大模型，并要求LLM进行相关的修复。

神奇吧，大模型不仅给我们提供知识，还随时帮助分析并解决程序出错的信息。

我们通过一个示例来演示如何使用自动修复解析器。假设我们有一个格式错误的JSON字符串，使用PydanticOutputParser解析时会引发错误。此时，我们可以使用OutputFixingParser来尝试自动修复格式错误。

重试解析器（RetryWithErrorOutputParser）实战

重试解析器在模型的初次输出不符合预期时，会尝试重新生成新的输出。它通过重新与模型交互，利用模型的推理能力来找回相关信息，使得输出更加完整和符合预期。

首先还是设计一个解析过程中的错误。

# 定义一个模板字符串，这个模板将用于生成提问
template = """Based on the user question, provide an Action and Action Input for what step should be taken.
{format_instructions}
Question: {query}
Response:"""# 定义一个Pydantic数据格式，它描述了一个"行动"类及其属性
from pydantic import BaseModel, Field
class Action(BaseModel):action: str = Field(description="action to take")action_input: str = Field(description="input to the action")# 使用Pydantic格式Action来初始化一个输出解析器
from langchain.output_parsers import PydanticOutputParser
parser = PydanticOutputParser(pydantic_object=Action)# 定义一个提示模板，它将用于向模型提问
from langchain.prompts import PromptTemplate
prompt = PromptTemplate(template="Answer the user query.\n{format_instructions}\n{query}\n",input_variables=["query"],partial_variables={"format_instructions": parser.get_format_instructions()},
)
prompt_value = prompt.format_prompt(query="What are the colors of Orchid?")# 定义一个错误格式的字符串
bad_response = '{"action": "search"}'
parser.parse(bad_response) # 如果直接解析，它会引发一个错误

由于bad_response只提供了action字段，而没有提供action_input字段，这与Action数据格式的预期不符，所以解析会失败。

我们首先尝试用OutputFixingParser来解决这个错误。

from langchain.output_parsers import OutputFixingParser
from langchain.chat_models import ChatOpenAI
fix_parser = OutputFixingParser.from_llm(parser=parser, llm=ChatOpenAI())
parse_result = fix_parser.parse(bad_response)
print('OutputFixingParser的parse结果:',parse_result)

我们来看看这个尝试解决了什么问题，没解决什么问题。

解决的问题有：

不完整的数据：原始的bad_response只提供了action字段而没有action_input字段。OutputFixingParser已经填补了这个缺失，为action_input字段提供了值 'query'。

没解决的问题有：

具体性：尽管OutputFixingParser为action_input字段提供了默认值 'query'，但这并不具有描述性。真正的查询是 “Orchid（兰花）的颜色是什么？”。所以，这个修复只是提供了一个通用的值，并没有真正地回答用户的问题。
可能的误导：'query' 可能被误解为一个指示，要求进一步查询某些内容，而不是作为实际的查询输入。

当然，还有更鲁棒的选择，我们最后尝试一下RetryWithErrorOutputParser这个解析器。

# 初始化RetryWithErrorOutputParser，它会尝试再次提问来得到一个正确的输出
from langchain.output_parsers import RetryWithErrorOutputParser
from langchain.llms import OpenAI
retry_parser = RetryWithErrorOutputParser.from_llm(parser=parser, llm=OpenAI(temperature=0)
)
parse_result = retry_parser.parse_with_prompt(bad_response, prompt_value)
print('RetryWithErrorOutputParser的parse结果:',parse_result)

我们通过一个示例来演示如何使用重试解析器。假设我们有一个输出不完整的响应，使用OutputFixingParser无法完全修复。此时，我们可以使用RetryWithErrorOutputParser来尝试重新生成完整的输出。这个解析器没有让我们失望，成功地还原了格式，甚至也根据传入的原始提示，还原了action_input字段的内容。RetryWithErrorOutputParser的parse结果：action='search' action_input='colors of Orchid'

总结

结构化解析器和Pydantic解析器都旨在从大型语言模型中获取格式化的输出。结构化解析器更适合简单的文本响应，而Pydantic解析器则提供了对复杂数据结构和类型的支持。选择哪种解析器取决于应用的具体需求和输出的复杂性。

自动修复解析器主要适用于纠正小的格式错误，而重试解析器则可以处理更复杂的问题，包括格式错误和内容缺失。

在选择解析器时，需要考虑具体的应用场景。如果仅面临格式问题，自动修复解析器可能足够；但如果输出的完整性和准确性至关重要，那么重试解析器可能是更好的选择。

思考题

到目前为止，我们已经使用了哪些LangChain输出解析器？请你说一说它们的用法和异同。
尝试使用其他类型的输出解析器，并把代码与大家分享。
为什么大模型能够返回JSON格式的数据？输出解析器用了什么魔法让大模型做到了这一点？
自动修复解析器的“修复”功能具体来说是怎样实现的？请做debug，研究一下LangChain在调用大模型之前如何设计“提示”。
重试解析器的原理是什么？它主要实现了解析器类的哪个可选方法？

题目较多，可以选择性思考，期待在留言区看到你的分享。如果你觉得内容对你有帮助，也欢迎分享给有需要的朋友！