当前位置：首页 > news >正文

用 pytorch 从零开始创建大语言模型（六）：对分类进行微调

news 2025/3/25 13:44:58

用 pytorch 从零开始创建大语言模型（六）：对分类进行微调

6 微调用于分类
- 6.1 微调的不同类别
- 6.2 准备数据集
- 6.3 创建数据加载器
- 6.4 使用预训练权重初始化模型
- 6.5 添加分类头部
- 6.6 计算分类损失和准确率
- 6.7 在监督数据上微调模型
- 6.8 使用LLM进行垃圾信息分类
总结

6 微调用于分类

本章内容包括：

介绍不同的LLM微调方法
为文本分类准备数据集
修改一个预训练的LLM以进行微调
微调LLM以识别垃圾信息
评估微调后LLM分类器的准确率
使用微调后的LLM对新数据进行分类

到目前为止，我们已经编写了LLM架构，对其进行了预训练，并学习了如何将来自外部来源（如OpenAI）的预训练权重导入我们的模型中。现在我们将开始收获劳动成果，通过在特定目标任务上对LLM进行微调，例如文本分类。本章中我们要研究的具体示例是将短信分类为“垃圾信息”或“非垃圾信息”。图6.1强调了对LLM进行微调的两种主要方式：用于分类的微调（步骤8）和用于指令跟随的微调（步骤9）。

在这里插入图片描述图6.1 编写LLM的三个主要阶段。本章重点是阶段3（步骤8）：将预训练的LLM微调为分类器。

6.1 微调的不同类别

微调语言模型最常见的方式是指令微调和分类微调。指令微调是指通过使用特定指令对语言模型进行一组任务的训练，从而提升其理解和执行以自然语言提示描述的任务的能力，如图6.2所示。

在这里插入图片描述图6.2 展示了两种不同的指令微调场景。上方的场景中，模型的任务是判断给定文本是否为垃圾信息。下方的场景中，模型被给予将一个英语句子翻译成德语的指令。

在分类微调中，如果你有机器学习背景，可能已经熟悉这个概念，模型被训练用于识别一组特定的类别标签，例如“垃圾信息”和“非垃圾信息”。分类任务的例子不限于LLM和电子邮件过滤：它们还包括通过图像识别不同植物物种；将新闻文章分类到体育、政治和科技等主题中；以及在医学图像中区分良性和恶性肿瘤。

关键点在于，一个经过分类微调的模型只能预测它在训练期间遇到过的类别。例如，它可以判断某样东西是“垃圾信息”还是“非垃圾信息”，如图6.3所示，但它无法对输入文本给出其他判断。

在这里插入图片描述 图6.3 使用LLM的文本分类场景。一个为垃圾信息分类而微调的模型在输入旁边不需要额外的指令。与一个经过指令微调的模型相比，它只能回答“垃圾信息”或“非垃圾信息”。

与图6.3中所示的分类微调模型相反，一个经过指令微调的模型通常能够执行更广泛的任务。我们可以将分类微调模型视为高度专业化的模型，而通常来说，开发一个专用模型要比开发一个能够胜任多种任务的通用模型更容易。

选择合适的方法
指令微调提高了模型根据特定用户指令理解并生成响应的能力。指令微调最适用于需要根据复杂用户指令处理多种任务的模型，从而提升模型的灵活性和交互质量。分类微调则非常适合需要将数据精确划分到预定义类别中的项目，例如情感分析或垃圾信息检测。

尽管指令微调更具通用性，但它需要更大的数据集和更高的计算资源来训练能够胜任各种任务的模型。相比之下，分类微调所需的数据和计算资源更少，但其用途仅限于模型所训练过的特定类别。

6.2 准备数据集

我们将对之前实现并预训练的GPT模型进行修改并进行分类微调。我们首先下载并准备数据集，如图6.4所示。为了提供一个直观且有实际意义的分类微调示例，我们将使用一个包含垃圾信息和非垃圾信息的短信数据集进行训练。

在这里插入图片描述 图6.4 对LLM进行分类微调的三阶段流程。阶段1涉及数据集准备，阶段2专注于模型设置，阶段3涵盖模型的微调与评估。

注意： 短信通常是通过手机发送的，而非电子邮件。然而，相同的步骤同样适用于电子邮件分类。感兴趣的读者可在附录B中找到电子邮件垃圾分类数据集的链接。

第一步是下载数据集。

清单 6.1 下载并解压数据集

import urllib.request
import zipfile
import os
from pathlib import Pathurl = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"
zip_path = "sms_spam_collection.zip"
extracted_path = "sms_spam_collection"
data_file_path = Path(extracted_path) / "SMSSpamCollection.tsv"def download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path):if data_file_path.exists():print(f"{data_file_path} already exists. Skipping download and extraction.")return# Downloading the filewith urllib.request.urlopen(url) as response:with open(zip_path, "wb") as out_file:out_file.write(response.read())# Unzipping the filewith zipfile.ZipFile(zip_path, "r") as zip_ref:zip_ref.extractall(extracted_path)# Add .tsv file extensionoriginal_file_path = Path(extracted_path) / "SMSSpamCollection"os.rename(original_file_path, data_file_path)print(f"File downloaded and saved as {data_file_path}")try:download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path)
except (urllib.error.HTTPError, urllib.error.URLError, TimeoutError) as e:print(f"Primary URL failed: {e}. Trying backup URL...")url = "https://f001.backblazeb2.com/file/LLMs-from-scratch/sms%2Bspam%2Bcollection.zip"download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path)

File downloaded and saved as sms_spam_collection\SMSSpamCollection.tsv

执行前面的代码后，数据集会以制表符分隔的文本文件形式保存在sms_spam_collection文件夹中的SMSSpamCollection.tsv中。我们可以如下所示将其加载为一个pandas的DataFrame：

import pandas as pddf = pd.read_csv(data_file_path, sep="\t", header=None, names=["Label", "Text"])
print(df)

     Label                                               Text
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro......                                                ...
5567  spam  This is the 2nd time we have tried 2 contact u...
5568   ham               Will ü b going to esplanade fr home?
5569   ham  Pity, * was in mood for that. So...any other s...
5570   ham  The guy did some bitching but I acted like i'd...
5571   ham                         Rofl. Its true to its name[5572 rows x 2 columns]

让我们来看看类标签的分布情况：

print(df["Label"].value_counts())

执行前面的代码后，我们发现数据中包含 “火腿肠”（即非垃圾邮件）的频率远远高于 “垃圾邮件”：

Label
ham     4825
spam     747
Name: count, dtype: int64

为简化处理，并且因为我们更倾向于使用一个小型数据集（这将加快LLM的微调速度），我们选择对数据集进行下采样，使其每个类别包含747个样本。

注意： 还有其他几种处理类别不平衡的方法，但这些内容超出了本书的范围。对处理不平衡数据方法感兴趣的读者可以在附录B中找到更多信息。

我们可以使用以下代码中的内容对数据集进行下采样并创建一个平衡的数据集。

清单 6.2 创建平衡数据集

def create_balanced_dataset(df):# Count the instances of "spam"num_spam = df[df["Label"] == "spam"].shape[0]# Randomly sample "ham" instances to match the number of "spam" instancesham_subset = df[df["Label"] == "ham"].sample(num_spam, random_state=123)# Combine ham "subset" with "spam"balanced_df = pd.concat([ham_subset, df[df["Label"] == "spam"]])return balanced_dfbalanced_df = create_balanced_dataset(df)
print(balanced_df["Label"].value_counts())

执行之前的代码平衡数据集后，我们可以看到现在垃圾邮件和非垃圾邮件的数量相等：

Label
ham     747
spam    747
Name: count, dtype: int64

接下来，我们将字符串类型的类别标签 “ham” 和 “spam” 分别转换为整数类型的类别标签 0 和 1：

balanced_df["Label"] = balanced_df["Label"].map({"ham": 0, "spam": 1})

这个过程类似于将文本转换为 token ID。然而，不同的是，这里我们不是使用包含五万多个词的GPT词汇表，而是只处理两个 token ID：0 和 1。

      Label                                               Text
4307      0  Awww dat is sweet! We can think of something t...
4138      0                             Just got to  &lt;#&gt;
4831      0  The word "Checkmate" in chess comes from the P...
4461      0  This is wishing you a great day. Moji told me ...
5440      0      Thank you. do you generally date the brothas?...                                                ...
5537      1  Want explicit SEX in 30 secs? Ring 02073162414...
5540      1  ASKED 3MOBILE IF 0870 CHATLINES INCLU IN FREE ...
5547      1  Had your contract mobile 11 Mnths? Latest Moto...
5566      1  REMINDER FROM O2: To get 2.50 pounds free call...
5567      1  This is the 2nd time we have tried 2 contact u...[1494 rows x 2 columns]

接下来，我们创建一个名为random_split的函数，将数据集划分为三个部分：70%用于训练，10%用于验证，20%用于测试。（这些比例在机器学习中很常见，用于训练、调整和评估模型。）

清单 6.3 分割数据集

def random_split(df, train_frac, validation_frac):# Shuffle the entire DataFramedf = df.sample(frac=1, random_state=123).reset_index(drop=True)# Calculate split indicestrain_end = int(len(df) * train_frac)validation_end = train_end + int(len(df) * validation_frac)# Split the DataFrametrain_df = df[:train_end]validation_df = df[train_end:validation_end]test_df = df[validation_end:]return train_df, validation_df, test_dftrain_df, validation_df, test_df = random_split(balanced_df, 0.7, 0.1)
# Test size is implied to be 0.2 as the remainder

让我们把数据集保存为 CSV（逗号分隔值）文件，以便以后再次使用：

train_df.to_csv("train.csv", index=None)
validation_df.to_csv("validation.csv", index=None)
test_df.to_csv("test.csv", index=None)

到目前为止，我们已经下载了数据集，对其进行了平衡，并将其分为训练子集和评估子集。现在，我们将设置用于训练模型的 PyTorch 数据加载器。

6.3 创建数据加载器

我们将开发与之前处理文本数据时实现的概念上类似的 PyTorch 数据加载器。之前，我们使用滑动窗口技术生成统一长度的文本块，然后将它们分组成批，以提高模型训练的效率。每个文本块都作为一个独立的训练样本。然而，我们现在处理的是一个包含不同长度短信的垃圾短信数据集。为了像处理文本块那样对这些短信进行批处理，我们有两个主要选项：

将所有短信截断为数据集或批次中最短短信的长度；
将所有短信填充到数据集或批次中最长短信的长度。

第一个选项在计算上更为廉价，但如果较短的短信明显短于平均长度或最长短信，这可能导致重要信息的丢失，从而降低模型性能。因此，我们选择第二个选项，它可以保留所有短信的完整内容。

为了实现批处理并将所有短信填充到数据集中最长短信的长度，我们需要为所有较短的短信添加填充标记。为此，我们使用"<|endoftext|>"作为填充标记。

import tiktokentokenizer = tiktoken.get_encoding("gpt2")
print(tokenizer.encode("<|endoftext|>", allowed_special={"<|endoftext|>"}))

在这里插入图片描述图 6.6 输入文本的预处理过程。首先，每条输入短信被转换为一系列标记ID。然后，为了确保序列长度一致，较短的序列会被填充一个填充标记（在本例中为标记ID 50256），以匹配最长序列的长度。

事实上，执行前面的代码会返回：

[50256]

我们首先需要实现一个 PyTorch 数据集（Dataset），该数据集指定了如何加载和处理数据，然后才能实例化数据加载器。为此，我们定义了 SpamDataset 类，该类实现了图 6.6 中的概念。SpamDataset 类处理了几个关键任务：它识别训练数据集中最长的序列，对短信进行编码，并确保所有其他序列都使用填充标记填充到与最长序列相同的长度。

代码清单 6.4 设置 PyTorch Dataset 类

class SpamDataset(Dataset):def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256):self.data = pd.read_csv(csv_file)# Pre-tokenize textsself.encoded_texts = [tokenizer.encode(text) for text in self.data["Text"]]if max_length is None:self.max_length = self._longest_encoded_length()else:self.max_length = max_length# Truncate sequences if they are longer than max_lengthself.encoded_texts = [encoded_text[:self.max_length]for encoded_text in self.encoded_texts]# Pad sequences to the longest sequenceself.encoded_texts = [encoded_text + [pad_token_id] * (self.max_length - len(encoded_text))for encoded_text in self.encoded_texts]def __getitem__(self, index):encoded = self.encoded_texts[index]label = self.data.iloc[index]["Label"]return (torch.tensor(encoded, dtype=torch.long),torch.tensor(label, dtype=torch.long))def __len__(self):return len(self.data)def _longest_encoded_length(self):max_length = 0for encoded_text in self.encoded_texts:encoded_length = len(encoded_text)if encoded_length > max_length:max_length = encoded_lengthreturn max_length# Note: A more pythonic version to implement this method# is the following, which is also used in the next chapter:# return max(len(encoded_text) for encoded_text in self.encoded_texts)

SpamDataset 类从我们之前创建的 CSV 文件中加载数据，使用 tiktoken 中的 GPT-2 分词器对文本进行分词，并允许我们将序列填充或截断为由最长序列或预定义最大长度确定的统一长度。这确保了每个输入张量具有相同的尺寸，这是我们接下来实现的训练数据加载器中创建批次所必需的。

train_dataset = SpamDataset(csv_file="train.csv",max_length=None,tokenizer=tokenizer
)

最长序列长度存储在数据集的 max_lengthattribute 中。如果您想查看最长序列中的标记数，可以使用下面的代码：

print(train_dataset.max_length)

代码输出120，说明最长的序列不超过120个token，这对于短信而言是一个常见的长度。考虑到模型的上下文长度限制为1024个token，它最多可以处理长度为1024的序列。如果你的数据集中包含更长的文本，在创建训练数据集时可以通过传入参数 max_length=1024 来确保数据不会超过模型支持的最大输入（上下文）长度。

接下来，我们对验证集和测试集进行填充，使其长度与最长的训练序列保持一致。需要注意的是，任何超过训练集中最长样本长度的验证集和测试集样本，都会在我们之前定义的 SpamDataset 代码中使用 encoded_text[:self.max_length] 进行截断。这种截断操作是可选的；如果验证集和测试集中不存在长度超过1024个token的序列，也可以将 max_length=None 传入验证集和测试集。

val_dataset = SpamDataset(csv_file="validation.csv",max_length=train_dataset.max_length,tokenizer=tokenizer
)
test_dataset = SpamDataset(csv_file="test.csv",max_length=train_dataset.max_length,tokenizer=tokenizer
)

练习 6.1 增加上下文长度
将输入填充到模型所支持的最大token数量，并观察它对预测性能的影响。

我们可以通过将最大长度设置为1024，将输入填充到模型所支持的最大token数量：

max_length = 1024train_dataset = SpamDataset(base_path / "train.csv", max_length=max_length, tokenizer=tokenizer)
val_dataset = SpamDataset(base_path / "validation.csv", max_length=max_length, tokenizer=tokenizer)
test_dataset = SpamDataset(base_path / "test.csv", max_length=max_length, tokenizer=tokenizer)

或者，同样地，我们可以通过以下方式定义max_length：

max_length = model.pos_emb.weight.shape[0]

或

max_length = BASE_CONFIG["context_length"]

使用这些数据集作为输入，我们可以像之前处理文本数据那样实例化数据加载器。不过，在这种情况下，目标值表示的是类别标签，而不是文本中的下一个token。例如，如果我们选择的batch大小为8，那么每个batch将由8个长度为120的训练样本及其对应的类别标签组成，如图6.7所示。

在这里插入图片描述图6.7 一个训练批次由八条文本消息组成，每条消息以token ID表示。每条文本消息由120个token ID构成。一个类别标签数组存储这八条消息对应的类别标签，这些标签可以是0（“非垃圾信息”）或1（“垃圾信息”）。

下表中的代码创建了训练集、验证集和测试集数据加载器，这些数据加载器以 8 为单位分批加载文本信息和标签。

清单 6.5 创建 PyTorch 数据加载器

from torch.utils.data import DataLoadernum_workers = 0
batch_size = 8torch.manual_seed(123)train_loader = DataLoader(dataset=train_dataset,batch_size=batch_size,shuffle=True,num_workers=num_workers,drop_last=True,
)val_loader = DataLoader(dataset=val_dataset,batch_size=batch_size,num_workers=num_workers,drop_last=False,
)test_loader = DataLoader(dataset=test_dataset,batch_size=batch_size,num_workers=num_workers,drop_last=False,
)

为了确保数据加载器正常工作并确实返回了期望大小的批次，我们遍历训练集的加载器，并打印最后一个批次的张量维度：

print("Train loader:")
for input_batch, target_batch in train_loader:passprint("Input batch dimensions:", input_batch.shape)
print("Label batch dimensions", target_batch.shape)

输出结果为：

Train loader:
Input batch dimensions: torch.Size([8, 120])
Label batch dimensions torch.Size([8])

正如我们所看到的，输入批次由8个训练样本组成，每个样本包含120个token，与预期一致。标签张量存储了对应这8个训练样本的类别标签。

最后，为了了解数据集的规模，我们打印每个数据集中批次数量：

print(f"{len(train_loader)} training batches")
print(f"{len(val_loader)} validation batches")
print(f"{len(test_loader)} test batches")

每个数据集中的批次数如下：

130 training batches
19 validation batches
38 test batches

现在我们已经准备好了数据，接下来我们需要为微调准备模型。

6.4 使用预训练权重初始化模型

我们必须为分类微调任务准备模型，以识别垃圾信息。我们首先初始化预训练模型，如图6.8所示。

在这里插入图片描述 图6.8 用于对LLM进行分类微调的三阶段流程。完成第1阶段（准备数据集）后，我们现在需要初始化LLM，接下来我们将对其进行微调以分类垃圾短信。

为了开始模型准备流程，我们采用与预训练无标签数据时相同的配置：

CHOOSE_MODEL = "gpt2-small (124M)"
INPUT_PROMPT = "Every effort moves"BASE_CONFIG = {"vocab_size": 50257,     # Vocabulary size"context_length": 1024,  # Context length"drop_rate": 0.0,        # Dropout rate"qkv_bias": True         # Query-key-value bias
}model_configs = {"gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},"gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},"gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},"gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}BASE_CONFIG.update(model_configs[CHOOSE_MODEL])assert train_dataset.max_length <= BASE_CONFIG["context_length"], (f"Dataset length {train_dataset.max_length} exceeds model's context "f"length {BASE_CONFIG['context_length']}. Reinitialize data sets with "f"`max_length={BASE_CONFIG['context_length']}`"
)

接下来，我们从 gpt_download.py 文件中导入 download_and_load_gpt2 函数，并复用第5章预训练中使用的 GPTModel 类和 load_weights_into_gpt 函数，将下载的权重加载到GPT模型中。

清单 6.6 加载预训练的 GPT 模型

from gpt_download import download_and_load_gpt2
from previous_chapters import GPTModel, load_weights_into_gptmodel_size = CHOOSE_MODEL.split(" ")[-1].lstrip("(").rstrip(")")
settings, params = download_and_load_gpt2(model_size=model_size, models_dir="gpt2")model = GPTModel(BASE_CONFIG)
load_weights_into_gpt(model, params)
model.eval();

gpt_download.py, previous_chapters.py

在将模型权重加载到GPTModel之后，我们复用了第4章和第5章中的文本生成工具函数，以确保模型能够生成连贯的文本：

from previous_chapters import (generate_text_simple,text_to_token_ids,token_ids_to_text
)text_1 = "Every effort moves you"token_ids = generate_text_simple(model=model,idx=text_to_token_ids(text_1, tokenizer),max_new_tokens=15,context_size=BASE_CONFIG["context_length"]
)print(token_ids_to_text(token_ids, tokenizer))

以下输出表明模型能够生成连贯的文本，这说明模型权重已经被正确加载：

Every effort moves you forward.The first step is to understand the importance of your work

在我们开始将模型微调为垃圾信息分类器之前，我们可以先看看该模型是否已经能够通过指令进行垃圾信息分类：

text_2 = ("Is the following text 'spam'? Answer with 'yes' or 'no':"" 'You are a winner you have been specially"" selected to receive $1000 cash or a $2000 award.'"
)token_ids = generate_text_simple(model=model,idx=text_to_token_ids(text_2, tokenizer),max_new_tokens=23,context_size=BASE_CONFIG["context_length"]
)print(token_ids_to_text(token_ids, tokenizer))

模型输出如下：

Is the following text 'spam'? Answer with 'yes' or 'no': 'You are a winner you have been specially selected to receive $1000 cash or a $2000 award.'The following text 'spam'? Answer with 'yes' or 'no': 'You are a winner

从该输出可以明显看出，模型在理解指令方面存在困难。这一结果是预期之中的，因为模型仅接受过预训练，尚未经历过指令微调。因此，我们现在开始准备对模型进行分类微调。

6.5 添加分类头部

我们必须对预训练的LLM进行修改，以使其适应分类微调。为此，我们将原始的输出层替换掉。原始输出层将隐藏表示映射到一个包含50257个词汇的词表，而我们将其替换为一个更小的输出层，仅映射到两个类别：0（“非垃圾信息”）和1（“垃圾信息”），如图6.9所示。我们仍然使用之前的模型，只是替换了输出层。

输出层节点数
从技术角度讲，我们也可以只使用一个输出节点，因为这只是一个二分类任务。然而这会要求我们修改损失函数，正如我在《Losses Learned—Optimizing Negative Log-Likelihood and Cross-Entropy in PyTorch》（https://mng.bz/NRZ2）中讨论的那样。因此，我们选择一种更通用的方法，即输出节点的数量等于类别数量。例如，对于一个三分类问题，比如将新闻文章分类为“科技”、“体育”或“政治”，我们会使用三个输出节点，依此类推。

在这里插入图片描述 图6.9通过修改架构将GPT模型用于垃圾信息分类。最初，模型的线性输出层将768个隐藏单元映射到一个包含50257个词的词汇表中。为了检测垃圾信息，我们将这一层替换为一个新的输出层，它将相同的768个隐藏单元映射到仅两个类别，表示“垃圾信息”和“非垃圾信息”。

在尝试图 6.9 所示的修改之前，我们先通过 print(model) 打印模型结构：

GPTModel((tok_emb): Embedding(50257, 768)(pos_emb): Embedding(1024, 768)(drop_emb): Dropout(p=0.0, inplace=False)(trf_blocks): Sequential((0): TransformerBlock((att): MultiHeadAttention((W_query): Linear(in_features=768, out_features=768, bias=True)(W_key): Linear(in_features=768, out_features=768, bias=True)(W_value): Linear(in_features=768, out_features=768, bias=True)(out_proj): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.0, inplace=False))(ff): FeedForward((layers): Sequential((0): Linear(in_features=768, out_features=3072, bias=True)(1): GELU()(2): Linear(in_features=3072, out_features=768, bias=True)))(norm1): LayerNorm()(norm2): LayerNorm()(drop_resid): Dropout(p=0.0, inplace=False))(1): TransformerBlock((att): MultiHeadAttention((W_query): Linear(in_features=768, out_features=768, bias=True)(W_key): Linear(in_features=768, out_features=768, bias=True)(W_value): Linear(in_features=768, out_features=768, bias=True)(out_proj): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.0, inplace=False))(ff): FeedForward((layers): Sequential((0): Linear(in_features=768, out_features=3072, bias=True)(1): GELU()(2): Linear(in_features=3072, out_features=768, bias=True)))(norm1): LayerNorm()(norm2): LayerNorm()(drop_resid): Dropout(p=0.0, inplace=False))(2): TransformerBlock((att): MultiHeadAttention((W_query): Linear(in_features=768, out_features=768, bias=True)(W_key): Linear(in_features=768, out_features=768, bias=True)(W_value): Linear(in_features=768, out_features=768, bias=True)(out_proj): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.0, inplace=False))(ff): FeedForward((layers): Sequential((0): Linear(in_features=768, out_features=3072, bias=True)(1): GELU()(2): Linear(in_features=3072, out_features=768, bias=True)))(norm1): LayerNorm()(norm2): LayerNorm()(drop_resid): Dropout(p=0.0, inplace=False))(3): TransformerBlock((att): MultiHeadAttention((W_query): Linear(in_features=768, out_features=768, bias=True)(W_key): Linear(in_features=768, out_features=768, bias=True)(W_value): Linear(in_features=768, out_features=768, bias=True)(out_proj): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.0, inplace=False))(ff): FeedForward((layers): Sequential((0): Linear(in_features=768, out_features=3072, bias=True)(1): GELU()(2): Linear(in_features=3072, out_features=768, bias=True)))(norm1): LayerNorm()(norm2): LayerNorm()(drop_resid): Dropout(p=0.0, inplace=False))(4): TransformerBlock((att): MultiHeadAttention((W_query): Linear(in_features=768, out_features=768, bias=True)(W_key): Linear(in_features=768, out_features=768, bias=True)(W_value): Linear(in_features=768, out_features=768, bias=True)(out_proj): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.0, inplace=False))(ff): FeedForward((layers): Sequential((0): Linear(in_features=768, out_features=3072, bias=True)(1): GELU()(2): Linear(in_features=3072, out_features=768, bias=True)))(norm1): LayerNorm()(norm2): LayerNorm()(drop_resid): Dropout(p=0.0, inplace=False))(5): TransformerBlock((att): MultiHeadAttention((W_query): Linear(in_features=768, out_features=768, bias=True)(W_key): Linear(in_features=768, out_features=768, bias=True)(W_value): Linear(in_features=768, out_features=768, bias=True)(out_proj): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.0, inplace=False))(ff): FeedForward((layers): Sequential((0): Linear(in_features=768, out_features=3072, bias=True)(1): GELU()(2): Linear(in_features=3072, out_features=768, bias=True)))(norm1): LayerNorm()(norm2): LayerNorm()(drop_resid): Dropout(p=0.0, inplace=False))(6): TransformerBlock((att): MultiHeadAttention((W_query): Linear(in_features=768, out_features=768, bias=True)(W_key): Linear(in_features=768, out_features=768, bias=True)(W_value): Linear(in_features=768, out_features=768, bias=True)(out_proj): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.0, inplace=False))(ff): FeedForward((layers): Sequential((0): Linear(in_features=768, out_features=3072, bias=True)(1): GELU()(2): Linear(in_features=3072, out_features=768, bias=True)))(norm1): LayerNorm()(norm2): LayerNorm()(drop_resid): Dropout(p=0.0, inplace=False))(7): TransformerBlock((att): MultiHeadAttention((W_query): Linear(in_features=768, out_features=768, bias=True)(W_key): Linear(in_features=768, out_features=768, bias=True)(W_value): Linear(in_features=768, out_features=768, bias=True)(out_proj): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.0, inplace=False))(ff): FeedForward((layers): Sequential((0): Linear(in_features=768, out_features=3072, bias=True)(1): GELU()(2): Linear(in_features=3072, out_features=768, bias=True)))(norm1): LayerNorm()(norm2): LayerNorm()(drop_resid): Dropout(p=0.0, inplace=False))(8): TransformerBlock((att): MultiHeadAttention((W_query): Linear(in_features=768, out_features=768, bias=True)(W_key): Linear(in_features=768, out_features=768, bias=True)(W_value): Linear(in_features=768, out_features=768, bias=True)(out_proj): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.0, inplace=False))(ff): FeedForward((layers): Sequential((0): Linear(in_features=768, out_features=3072, bias=True)(1): GELU()(2): Linear(in_features=3072, out_features=768, bias=True)))(norm1): LayerNorm()(norm2): LayerNorm()(drop_resid): Dropout(p=0.0, inplace=False))(9): TransformerBlock((att): MultiHeadAttention((W_query): Linear(in_features=768, out_features=768, bias=True)(W_key): Linear(in_features=768, out_features=768, bias=True)(W_value): Linear(in_features=768, out_features=768, bias=True)(out_proj): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.0, inplace=False))(ff): FeedForward((layers): Sequential((0): Linear(in_features=768, out_features=3072, bias=True)(1): GELU()(2): Linear(in_features=3072, out_features=768, bias=True)))(norm1): LayerNorm()(norm2): LayerNorm()(drop_resid): Dropout(p=0.0, inplace=False))(10): TransformerBlock((att): MultiHeadAttention((W_query): Linear(in_features=768, out_features=768, bias=True)(W_key): Linear(in_features=768, out_features=768, bias=True)(W_value): Linear(in_features=768, out_features=768, bias=True)(out_proj): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.0, inplace=False))(ff): FeedForward((layers): Sequential((0): Linear(in_features=768, out_features=3072, bias=True)(1): GELU()(2): Linear(in_features=3072, out_features=768, bias=True)))(norm1): LayerNorm()(norm2): LayerNorm()(drop_resid): Dropout(p=0.0, inplace=False))(11): TransformerBlock((att): MultiHeadAttention((W_query): Linear(in_features=768, out_features=768, bias=True)(W_key): Linear(in_features=768, out_features=768, bias=True)(W_value): Linear(in_features=768, out_features=768, bias=True)(out_proj): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.0, inplace=False))(ff): FeedForward((layers): Sequential((0): Linear(in_features=768, out_features=3072, bias=True)(1): GELU()(2): Linear(in_features=3072, out_features=768, bias=True)))(norm1): LayerNorm()(norm2): LayerNorm()(drop_resid): Dropout(p=0.0, inplace=False)))(final_norm): LayerNorm()(out_head): Linear(in_features=768, out_features=50257, bias=False)
)

这个输出清晰地展示了我们在第4章中构建的架构。如前所述，GPTModel由嵌入层开始，接着是12个相同的transformer模块（图中为简洁起见仅展示了最后一个模块），然后是一个最终的LayerNorm层和输出层out_head。

接下来，我们将用一个新的输出层（参见图6.9）替换掉out_head，并对其进行微调。

微调部分层还是全部层
由于我们是从一个预训练模型开始，因此并不需要对所有模型层进行微调。在基于神经网络的语言模型中，较低的层通常捕捉的是通用的语言结构和语义，这些在各种任务和数据集之间都是适用的。因此，仅微调最后几层（即靠近输出的层）通常就足以使模型适应新的任务，因为这些层更关注微妙的语言模式和任务相关特征。一个额外的好处是，只微调少量层在计算上更高效。感兴趣的读者可以在附录B中找到更多相关信息和实验，了解哪些层值得微调。

为了让模型准备好进行分类微调，我们首先对模型进行“冻结”，这意味着我们将所有层设置为不可训练的：

for param in model.parameters():param.requires_grad = False

然后，我们替换输出层（model.out_head），该层最初将输入映射到50257维，即词汇表的大小（见图6.9）。

代码清单6.7 添加分类层

torch.manual_seed(123)num_classes = 2
model.out_head = torch.nn.Linear(in_features=BASE_CONFIG["emb_dim"], out_features=num_classes)

为了使代码更通用，我们使用 BASE_CONFIG["emb_dim"]，在“gpt2-small(124M)”模型中该值为768。因此，我们也可以使用相同的代码处理更大的GPT-2模型变体。

这个新的 model.out_head 输出层的 requires_grad 属性默认设置为 True，这意味着它是模型中唯一会在训练期间被更新的层。从技术上讲，仅训练我们刚添加的输出层就足够了。然而，正如我在实验中发现的，微调额外的层会明显提升模型的预测性能。（更多细节见附录B。）

我们还将最后一个transformer模块（model.trf_blocks[-1]）以及将该模块连接到输出层的最终LayerNorm模块（model.final_norm）配置为可训练的，如图6.10所示。

为了使最终的LayerNorm和最后一个transformer模块可训练，我们将它们各自的 requires_grad 属性设置为 True：

for param in model.trf_blocks[-1].parameters():param.requires_grad = Truefor param in model.final_norm.parameters():param.requires_grad = True

练习 6.2 微调整个模型
与其只微调最后一个 transformer 块，不如对整个模型进行微调，并评估其对预测性能的影响。

与其只微调最后一个 transformer 块，我们也可以通过删除以下代码行来微调整个模型：

for param in model.parameters():param.requires_grad = False

在这里插入图片描述 图 6.10 GPT 模型包含 12 个重复的 transformer 块。除了输出层外，我们将最终的 LayerNorm 和最后一个 transformer 块设置为可训练。其余的 11 个 transformer 块以及嵌入层保持为不可训练状态。

尽管我们添加了一个新的输出层，并将某些层标记为可训练或不可训练，我们仍然可以像以前一样使用这个模型。例如，我们可以将一条与之前示例相同的文本传入模型：

inputs = tokenizer.encode("Do you have time")
inputs = torch.tensor(inputs).unsqueeze(0)
print("Inputs:", inputs)
print("Inputs dimensions:", inputs.shape) # shape: (batch_size, num_tokens)

输出显示上述代码将输入编码为一个包含 4 个输入 token 的张量：

Inputs: tensor([[5211,  345,  423,  640]])
Inputs dimensions: torch.Size([1, 4])

然后，我们可以像往常一样将编码后的 token ID 传递给模型：

with torch.no_grad():outputs = model(inputs)print("Outputs:\n", outputs)
print("Outputs dimensions:", outputs.shape) # shape: (batch_size, num_tokens, num_classes)

输出张量如下所示：

Outputs:tensor([[[-1.5854,  0.9904],[-3.7235,  7.4548],[-2.2661,  6.6049],[-3.5983,  3.9902]]])
Outputs dimensions: torch.Size([1, 4, 2])

在这里插入图片描述 图 6.11 展示了 GPT 模型在接收四个 token 的示例输入时的输出。由于我们修改了输出层，输出张量包含两列。我们在将模型用于垃圾信息分类微调时，只关注最后一行，也就是最后一个 token 对应的输出。

类似的输入在先前的模型中会产生一个形状为 $[1, 4, 50257]$ 的输出张量，其中 50257 表示词表的大小。输出张量的行数与输入 token 数量一致（此例中为四个）。不过，现在每个输出的嵌入维度（列数）是 2 而不是 50257，因为我们替换了模型的输出层。

请记住，我们希望微调这个模型，以返回一个类别标签，判断输入是否为“垃圾信息”或“非垃圾信息”。我们不需要对所有四行输出都进行微调，而是只需关注一个输出 token。具体来说，我们只关注输出张量中的最后一行，也就是最后一个 token 的输出，如图 6.11 所示。

要从输出张量中提取最后一个 token 的输出，我们可以使用以下代码：

print("Last output token:", outputs[:, -1, :])

输出为：

Last output token: tensor([[-3.5983,  3.9902]])

我们仍然需要将这些值转换为类别预测。但首先，让我们理解为何只关注最后一个输出 token。

我们已经学习过注意力机制，它建立了每个输入 token 与其他输入 token 之间的联系，以及 GPT 类模型中常用的因果注意力掩码（参见第 3 章）。该掩码限制了一个 token 的注意力范围，使其只能关注当前位置及之前的位置，从而保证每个 token 只能受到自己和先前 token 的影响，如图 6.12 所示。

在这里插入图片描述图 6.12 展示了因果注意力机制，其中输入 token 之间的注意力得分以矩阵形式呈现。空白单元格表示由于因果注意力掩码而被屏蔽的位置，防止 token 关注未来的 token。单元格中的数值表示注意力得分；最后一个 token “time” 是唯一一个可以对所有前面 token 计算注意力得分的 token。

根据图 6.12 中的因果注意力掩码设置，序列中的最后一个 token 聚合的信息最多，因为它是唯一一个能够访问所有先前 token 数据的 token。因此，在垃圾信息分类任务中，我们在微调过程中重点关注最后一个 token。

现在我们已经准备好将最后一个 token 的输出转化为类别标签预测，并计算模型的初始预测准确率。随后，我们将正式对模型进行垃圾信息分类任务的微调。

练习 6.3 微调第一个 token 与最后一个 token 的比较
尝试微调第一个输出 token。观察与微调最后一个输出 token 相比，在预测性能方面的差异。

我们可以通过以下方式，将微调的目标从最后一个输出 token 改为第一个输出 token：

将代码中的
model(input_batch)[:, -1, :]
改为
model(input_batch)[:, 0, :]
即可在代码中的所有位置使用第一个输出 token 进行微调。

6.6 计算分类损失和准确率

在我们开始微调模型之前，还剩下一项小任务：我们必须实现微调过程中用于模型评估的函数，如图6.13所示。

在实现这些评估工具之前，我们先简单讨论一下如何将模型的输出转换为类别标签预测。此前，我们通过softmax函数将50,257维的输出转换为概率，然后用argmax函数返回最大概率的位置，以此计算LLM生成的下一个token的token ID。在这里，我们采用同样的方法来计算模型对于给定输入是否输出“垃圾信息”或“非垃圾信息”的预测，如图6.14所示。唯一的区别在于，我们现在处理的是2维输出，而不是50,257维输出。

在这里插入图片描述图6.13 展示了对LLM进行分类微调的三阶段流程。我们已经完成了前六步。现在准备执行第2阶段的最后一步：实现模型性能评估函数，以便在微调前、中、后用于垃圾信息分类任务。

在这里插入图片描述图6.14 模型对应于最后一个token的输出被转换为每条输入文本的概率分数。通过查找概率最高值的位置索引得到类别标签。由于模型尚未经过训练，因此会错误地预测垃圾信息标签。

我们通过一个具体示例来考察最后一个token的输出：

print("Last output token:\n", outputs[:, -1, :])

对应于最后一个token的张量值为：

Last output token:tensor([[-3.5983,  3.9902]])

我们可以通过如下方式获取类别标签：

probas = torch.softmax(outputs[:, -1, :], dim=-1)
label = torch.argmax(probas)
print("Class label:", label.item())

在这个例子中，代码返回Class label: 1，意味着模型预测该输入文本为“垃圾信息”。在此处使用softmax函数是可选的，因为最大输出值直接对应于最大概率分数。因此，我们可以省略softmax函数，简化代码如下：

logits = outputs[:, -1, :]
label = torch.argmax(logits)
print("Class label:", label.item())

这个概念可以用于计算分类准确率，即模型在整个数据集中预测正确的比例。

为了计算分类准确率，我们将基于argmax的预测代码应用于数据集中的所有样本，并定义一个calc_accuracy_loader函数来计算预测正确的比例。

清单 6.8 计算分类准确率

def calc_accuracy_loader(data_loader, model, device, num_batches=None):model.eval()correct_predictions, num_examples = 0, 0if num_batches is None:num_batches = len(data_loader)else:num_batches = min(num_batches, len(data_loader))for i, (input_batch, target_batch) in enumerate(data_loader):if i < num_batches:input_batch, target_batch = input_batch.to(device), target_batch.to(device)with torch.no_grad():logits = model(input_batch)[:, -1, :]  # Logits of last output tokenpredicted_labels = torch.argmax(logits, dim=-1)num_examples += predicted_labels.shape[0]correct_predictions += (predicted_labels == target_batch).sum().item()else:breakreturn correct_predictions / num_examples

我们使用该函数来估计多个数据集上的分类准确率，为了提高效率，仅使用10个batch进行估计：

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")# Note:
# Uncommenting the following lines will allow the code to run on Apple Silicon chips, if applicable,
# which is approximately 2x faster than on an Apple CPU (as measured on an M3 MacBook Air).
# As of this writing, in PyTorch 2.4, the results obtained via CPU and MPS were identical.
# However, in earlier versions of PyTorch, you may observe different results when using MPS.#if torch.cuda.is_available():
#    device = torch.device("cuda")
#elif torch.backends.mps.is_available():
#    device = torch.device("mps")
#else:
#    device = torch.device("cpu")
#print(f"Running on {device} device.")model.to(device) # no assignment model = model.to(device) necessary for nn.Module classestorch.manual_seed(123) # For reproducibility due to the shuffling in the training data loadertrain_accuracy = calc_accuracy_loader(train_loader, model, device, num_batches=10)
val_accuracy = calc_accuracy_loader(val_loader, model, device, num_batches=10)
test_accuracy = calc_accuracy_loader(test_loader, model, device, num_batches=10)print(f"Training accuracy: {train_accuracy*100:.2f}%")
print(f"Validation accuracy: {val_accuracy*100:.2f}%")
print(f"Test accuracy: {test_accuracy*100:.2f}%")

通过设置device参数，如果有支持Nvidia CUDA的GPU可用，模型将自动在GPU上运行；否则将在CPU上运行。输出如下：

Training accuracy: 46.25%
Validation accuracy: 45.00%
Test accuracy: 48.75%

如我们所见，预测准确率接近随机猜测的水平（在本例中为50%）。为了提高预测准确率，我们需要对模型进行微调。

但在开始微调模型之前，我们必须先定义训练过程中要优化的损失函数。我们的目标是最大化模型在垃圾信息分类任务上的准确率，这意味着上述代码应输出正确的类别标签：非垃圾信息对应0，垃圾信息对应1。

由于分类准确率不是一个可导函数，我们使用交叉熵损失（cross-entropy loss）作为其代理来最大化准确率。因此，calc_loss_batch函数保持不变，仅作一个调整：我们仅关注优化最后一个token，即model(input_batch)[:,-1,:]，而不是所有的tokens，即model(input_batch)。

def calc_loss_batch(input_batch, target_batch, model, device):input_batch, target_batch = input_batch.to(device), target_batch.to(device)logits = model(input_batch)[:, -1, :]  # Logits of last output tokenloss = torch.nn.functional.cross_entropy(logits, target_batch)return loss

我们使用calc_loss_batch函数来计算从前面定义的数据加载器中获取的单个批次的损失。为了计算一个数据加载器中所有批次的损失，我们像之前一样定义calc_loss_loader函数。

清单 6.9 计算分类损失

# Same as in chapter 5
def calc_loss_loader(data_loader, model, device, num_batches=None):total_loss = 0.if len(data_loader) == 0:return float("nan")elif num_batches is None:num_batches = len(data_loader)else:# Reduce the number of batches to match the total number of batches in the data loader# if num_batches exceeds the number of batches in the data loadernum_batches = min(num_batches, len(data_loader))for i, (input_batch, target_batch) in enumerate(data_loader):if i < num_batches:loss = calc_loss_batch(input_batch, target_batch, model, device)total_loss += loss.item()else:breakreturn total_loss / num_batches

与计算训练精度类似，我们现在计算每个数据集的初始损失：

with torch.no_grad(): # Disable gradient tracking for efficiency because we are not training, yettrain_loss = calc_loss_loader(train_loader, model, device, num_batches=5)val_loss = calc_loss_loader(val_loader, model, device, num_batches=5)test_loss = calc_loss_loader(test_loader, model, device, num_batches=5)print(f"Training loss: {train_loss:.3f}")
print(f"Validation loss: {val_loss:.3f}")
print(f"Test loss: {test_loss:.3f}")

初始损耗值为：

Training loss: 2.453
Validation loss: 2.583
Test loss: 2.322

接下来，我们将实现一个训练函数来微调模型，也就是说通过调整模型以最小化训练集损失。最小化训练集损失将有助于提高分类准确率，而这正是我们的总体目标。

6.7 在监督数据上微调模型

我们必须定义并使用训练函数来微调预训练的LLM，从而提升其对垃圾信息的分类准确率。图6.15所示的训练循环与我们用于预训练的整体训练循环相同；唯一的区别在于，我们现在是计算分类准确率，而不是生成示例文本来评估模型。

在这里插入图片描述 图6.15 PyTorch中用于训练深度神经网络的典型训练循环包含多个步骤，在训练集上按批次迭代多个周期。在每次循环中，我们计算每个训练批次的损失，用于确定损失梯度，并利用这些梯度来更新模型权重，以最小化训练集损失。

实现图6.15中概念的训练函数也与我们在预训练模型时使用的train_model_simple函数非常相似。唯一的两个区别是，我们现在追踪的是已经看到的训练样本数量（examples_seen），而不是token数量，并且我们在每个周期后计算准确率，而不是打印示例文本。

清单 6.10 微调模型以分类垃圾邮件

# Overall the same as `train_model_simple` in chapter 5
def train_classifier_simple(model, train_loader, val_loader, optimizer, device, num_epochs, eval_freq, eval_iter):# Initialize lists to track losses and examples seentrain_losses, val_losses, train_accs, val_accs = [], [], [], []examples_seen, global_step = 0, -1# Main training loopfor epoch in range(num_epochs):model.train()  # Set model to training modefor input_batch, target_batch in train_loader:optimizer.zero_grad() # Reset loss gradients from previous batch iterationloss = calc_loss_batch(input_batch, target_batch, model, device)loss.backward() # Calculate loss gradientsoptimizer.step() # Update model weights using loss gradientsexamples_seen += input_batch.shape[0] # New: track examples instead of tokensglobal_step += 1# Optional evaluation stepif global_step % eval_freq == 0:train_loss, val_loss = evaluate_model(model, train_loader, val_loader, device, eval_iter)train_losses.append(train_loss)val_losses.append(val_loss)print(f"Ep {epoch+1} (Step {global_step:06d}): "f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")# Calculate accuracy after each epochtrain_accuracy = calc_accuracy_loader(train_loader, model, device, num_batches=eval_iter)val_accuracy = calc_accuracy_loader(val_loader, model, device, num_batches=eval_iter)print(f"Training accuracy: {train_accuracy*100:.2f}% | ", end="")print(f"Validation accuracy: {val_accuracy*100:.2f}%")train_accs.append(train_accuracy)val_accs.append(val_accuracy)return train_losses, val_losses, train_accs, val_accs, examples_seen

evaluate_modelfunction与我们用于预训练的函数完全相同：

# Same as chapter 5
def evaluate_model(model, train_loader, val_loader, device, eval_iter):model.eval()with torch.no_grad():train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)model.train()return train_loss, val_loss

接下来，我们初始化优化器，设置训练轮数（epochs），并使用train_classifier_simple函数启动训练。在一台M3版MacBook Air笔记本电脑上，训练大约需要6分钟；而在V100或A100 GPU上，训练则不到半分钟即可完成：

import timestart_time = time.time()torch.manual_seed(123)optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.1)num_epochs = 5
train_losses, val_losses, train_accs, val_accs, examples_seen = train_classifier_simple(model, train_loader, val_loader, optimizer, device,num_epochs=num_epochs, eval_freq=50, eval_iter=5,
)end_time = time.time()
execution_time_minutes = (end_time - start_time) / 60
print(f"Training completed in {execution_time_minutes:.2f} minutes.")

训练过程中我们将会看到如下输出：

Ep 1 (Step 000000): Train loss 2.153, Val loss 2.392
Ep 1 (Step 000050): Train loss 0.617, Val loss 0.637
Ep 1 (Step 000100): Train loss 0.523, Val loss 0.557
Training accuracy: 70.00% | Validation accuracy: 72.50%
Ep 2 (Step 000150): Train loss 0.561, Val loss 0.489
Ep 2 (Step 000200): Train loss 0.419, Val loss 0.397
Ep 2 (Step 000250): Train loss 0.409, Val loss 0.353
Training accuracy: 82.50% | Validation accuracy: 85.00%
Ep 3 (Step 000300): Train loss 0.333, Val loss 0.320
Ep 3 (Step 000350): Train loss 0.340, Val loss 0.306
Training accuracy: 90.00% | Validation accuracy: 90.00%
Ep 4 (Step 000400): Train loss 0.136, Val loss 0.200
Ep 4 (Step 000450): Train loss 0.153, Val loss 0.132
Ep 4 (Step 000500): Train loss 0.222, Val loss 0.137
Training accuracy: 100.00% | Validation accuracy: 97.50%
Ep 5 (Step 000550): Train loss 0.207, Val loss 0.143
Ep 5 (Step 000600): Train loss 0.083, Val loss 0.074
Training accuracy: 100.00% | Validation accuracy: 97.50%
Training completed in 0.93 minutes.

然后，我们使用 Matplotlib 绘制训练集和验证集的损失函数图。

清单 6.11 绘制分类损失图

import matplotlib.pyplot as pltdef plot_values(epochs_seen, examples_seen, train_values, val_values, label="loss"):fig, ax1 = plt.subplots(figsize=(5, 3))# Plot training and validation loss against epochsax1.plot(epochs_seen, train_values, label=f"Training {label}")ax1.plot(epochs_seen, val_values, linestyle="-.", label=f"Validation {label}")ax1.set_xlabel("Epochs")ax1.set_ylabel(label.capitalize())ax1.legend()# Create a second x-axis for examples seenax2 = ax1.twiny()  # Create a second x-axis that shares the same y-axisax2.plot(examples_seen, train_values, alpha=0)  # Invisible plot for aligning ticksax2.set_xlabel("Examples seen")fig.tight_layout()  # Adjust layout to make roomplt.savefig(f"{label}-plot.pdf")plt.show()epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
examples_seen_tensor = torch.linspace(0, examples_seen, len(train_losses))plot_values(epochs_tensor, examples_seen_tensor, train_losses, val_losses)

图 6.16 绘制了由此得出的损耗曲线。

在这里插入图片描述图6.16 模型在五轮训练中的训练损失和验证损失。实线表示训练损失，虚线表示验证损失。可以看到，这两者在第一轮时迅速下降，并在第五轮左右逐渐趋于稳定。这一趋势表明模型学习进展良好，说明模型不仅从训练数据中学习到了知识，还很好地泛化到了未见过的验证数据上。

如图6.16中陡峭的下降趋势所示，模型从训练数据中学得很好，而且几乎没有过拟合的迹象；也就是说，训练集和验证集的损失之间没有明显的差距。

选择训练轮数（epochs）
前面在启动训练时，我们将训练轮数设置为5。训练轮数取决于数据集的规模和任务的难度，没有通用的解决方案或建议，尽管将轮数设为5通常是一个不错的起点。如果在前几轮后模型出现了过拟合，如损失曲线图（见图6.16）所示，那么你可能需要减少训练轮数。反之，如果趋势线表明验证损失在继续训练后可能进一步下降，则应增加训练轮数。在这个具体案例中，5轮训练是一个合理的选择，因为没有出现早期过拟合的迹象，而且验证损失接近于0。

接下来，我们使用相同的plot_values函数绘制分类准确率图：

epochs_tensor = torch.linspace(0, num_epochs, len(train_accs))
examples_seen_tensor = torch.linspace(0, examples_seen, len(train_accs))plot_values(epochs_tensor, examples_seen_tensor, train_accs, val_accs, label="accuracy")

图6.17显示了最终的准确率曲线。模型在第4轮和第5轮后达到了相对较高的训练和验证准确率。值得注意的是，我们之前将eval_iter=5进行了设置。

在这里插入图片描述 图6.17 无论是训练准确率（实线）还是验证准确率（虚线）在训练初期都大幅上升，随后趋于平稳，几乎达到了完美的准确率得分1.0。两条曲线在整个训练轮数过程中保持非常接近，这表明模型对训练数据并没有出现明显的过拟合现象。

当使用train_classifier_simple函数时，这意味着我们对训练和验证性能的估计仅基于5个批次，这是为了在训练过程中提高效率。

现在，我们必须通过运行以下代码，对整个数据集的训练集、验证集和测试集计算性能指标，这一次不再定义eval_iter值：

train_accuracy = calc_accuracy_loader(train_loader, model, device)
val_accuracy = calc_accuracy_loader(val_loader, model, device)
test_accuracy = calc_accuracy_loader(test_loader, model, device)print(f"Training accuracy: {train_accuracy*100:.2f}%")
print(f"Validation accuracy: {val_accuracy*100:.2f}%")
print(f"Test accuracy: {test_accuracy*100:.2f}%")

最终得到的准确率如下：

Training accuracy: 97.21%
Validation accuracy: 97.32%
Test accuracy: 95.67%

训练集和测试集的性能几乎相同。训练集与测试集准确率之间的轻微差异表明对训练数据的过拟合很小。通常情况下，验证集的准确率会略高于测试集的准确率，这是因为模型开发过程中常常会调整超参数以在验证集上获得良好表现，但这些调整可能无法有效泛化到测试集上。这种情况很常见，但可以通过调整模型设置来尽量缩小这种差距，例如增加dropout率（ $drop\_rate$ ）或在优化器配置中提高 $weight\_decay$ 参数。

6.8 使用LLM进行垃圾信息分类

在完成了模型的微调与评估之后，我们现在已经准备好对垃圾信息进行分类了（见图6.18）。我们将使用基于GPT的垃圾信息分类微调模型。下面的classify_review函数遵循了与之前实现的SpamDataset中相似的数据预处理步骤。接着，在将文本处理成token ID之后，该函数使用模型来预测一个整数类别标签（类似于我们在第6.6节中实现的内容），并返回相应的类别名称。

在这里插入图片描述 图6.18 对LLM进行分类微调的三阶段过程。第10步是第3阶段的最后一步——使用微调后的模型对新的垃圾信息进行分类。

代码清单6.12 使用模型对新文本进行分类

def classify_review(text, model, tokenizer, device, max_length=None, pad_token_id=50256):model.eval()# Prepare inputs to the modelinput_ids = tokenizer.encode(text)supported_context_length = model.pos_emb.weight.shape[0]# Note: In the book, this was originally written as pos_emb.weight.shape[1] by mistake# It didn't break the code but would have caused unnecessary truncation (to 768 instead of 1024)# Truncate sequences if they too longinput_ids = input_ids[:min(max_length, supported_context_length)]# Pad sequences to the longest sequenceinput_ids += [pad_token_id] * (max_length - len(input_ids))input_tensor = torch.tensor(input_ids, device=device).unsqueeze(0) # add batch dimension# Model inferencewith torch.no_grad():logits = model(input_tensor)[:, -1, :]  # Logits of the last output tokenpredicted_label = torch.argmax(logits, dim=-1).item()# Return the classified resultreturn "spam" if predicted_label == 1 else "not spam"

让我们在一个文本示例中试试这个 classify_review 函数：

text_1 = ("You are a winner you have been specially"" selected to receive $1000 cash or a $2000 award."
)print(classify_review(text_1, model, tokenizer, device, max_length=train_dataset.max_length
))

由此产生的模型能正确预测 spam . 让我们再举一个例子：

text_2 = ("Hey, just wanted to check if we're still on"" for dinner tonight? Let me know!"
)print(classify_review(text_2, model, tokenizer, device, max_length=train_dataset.max_length
))

模型再次做出了正确预测，并返回了 not spam 的标签。

最后，我们来保存模型，以便在日后再次使用时无需重新训练。我们可以使用torch.save方法：

torch.save(model.state_dict(), "review_classifier.pth")

保存后，就可以加载模型：

model_state_dict = torch.load("review_classifier.pth", map_location=device, weights_only=True)
model.load_state_dict(model_state_dict)

总结

针对LLM（大语言模型）的微调有多种策略，包括分类微调和指令微调。
分类微调的做法是通过一个较小的分类层替换LLM的输出层。
在将文本信息分类为“垃圾信息”或“非垃圾信息”的任务中，新的分类层仅包含两个输出节点。而此前我们使用的输出节点数量等于词汇表中唯一token的数量（即50,256个）。
与预训练阶段预测文本中下一个token不同，分类微调训练模型输出正确的类别标签——例如“垃圾信息”或“非垃圾信息”。
用于微调的模型输入是经过token ID转换的文本，与预训练阶段相似。
在微调LLM之前，我们先加载预训练模型作为基础模型。
评估分类模型的方式是计算分类准确率（即正确预测的比例或百分比）。
对分类模型的微调使用的仍是交叉熵损失函数，与预训练LLM时使用的相同。