当前位置：首页 > news >正文

自然语言处理NLP入门 -- 第二节预处理文本数据

news 2025/12/15 10:15:45

在自然语言处理（NLP）中，数据的质量直接影响模型的表现。文本预处理的目标是清理和标准化文本数据，使其适合机器学习或深度学习模型处理。本章介绍几种常见的文本预处理方法，并通过 Python 代码进行示例。

2.1 文本清理

文本数据往往包含各种噪音，例如 HTML 标签、特殊字符、空格、数字等。清理文本可以提高模型的准确性。

常见的清理步骤

去除 HTML 标签
移除特殊字符（如 @#%$&）
移除数字
统一大小写（通常转换为小写）
去除多余的空格

Python 示例

import re  # 正则表达式库，用于文本匹配和替换text = "Hello, <b>world</b>! Visit us at https://example.com or call 123-456-7890."# 1. 去除HTML标签
text = re.sub(r'<.*?>', '', text)# 2. 去除特殊字符（保留字母和空格）
text = re.sub(r'[^a-zA-Z\s]', '', text)# 3. 转换为小写
text = text.lower()# 4. 去除多余空格
text = " ".join(text.split())print(text)

输出：

hello world visit us at httpsexamplecom or call

2.2 分词（Tokenization）

分词是将文本拆分成单个的单词或子词，是 NLP 任务的基础。

常见分词方法

按空格拆分（适用于英文）
NLTK 分词（更精准）
spaCy 分词（高效处理大规模数据）

Python 示例

import nltk  # 自然语言处理库，提供分词、词性标注、停用词等功能
from nltk.tokenize import word_tokenize, sent_tokenize
import spacy  # 现代 NLP 库，优化分词、词性标注等任务nltk.download('punkt_tab')  # punkt_tab 是 NLTK 中的分词模型text = "Hello world! This is an NLP tutorial."# 1. 基础空格分词
tokens_space = text.split()
print("空格分词:", tokens_space)# 2. 使用 NLTK 进行分词
tokens_nltk = word_tokenize(text)
print("NLTK 分词:", tokens_nltk)# 3. 使用 spaCy 进行分词
nlp = spacy.load("en_core_web_sm")  # 加载预训练的小型英文模型
doc = nlp(text)
tokens_spacy = [token.text for token in doc]
print("spaCy 分词:", tokens_spacy)

输出：

空格分词: ['Hello', 'world!', 'This', 'is', 'an', 'NLP', 'tutorial.']
NLTK 分词: ['Hello', 'world', '!', 'This', 'is', 'an', 'NLP', 'tutorial', '.']
spaCy 分词: ['Hello', 'world', '!', 'This', 'is', 'an', 'NLP', 'tutorial', '.']

注意：

空格分词简单但容易出错，如 “NLP tutorial.” 仍包含标点。
NLTK 和 spaCy 处理得更精准，分离了标点符号。

2.3 词干提取（Stemming）和词形还原（Lemmatization）

在 NLP 任务中，单词的不同形式可能具有相同的含义，例如：

running 和 run
better 和 good

词干提取和词形还原可以将单词标准化，从而提高模型的泛化能力。

词干提取（Stemming）

词干提取是基于规则的词形归一化方法，会粗暴地去掉单词的后缀。

from nltk.stem import PorterStemmer, SnowballStemmer  # 词干提取工具stemmer = PorterStemmer()  # PorterStemmer 是常用的词干提取方法
words = ["running", "flies", "easily", "studies"]stemmed_words = [stemmer.stem(word) for word in words]
print("Porter Stemmer:", stemmed_words)

输出：

Porter Stemmer: ['run', 'fli', 'easili', 'studi']

缺点：

flies 变成了 fli
easily 变成了 easili
可能导致含义丢失

词形还原（Lemmatization）

Lemmatization 通过查找词典将单词转换为其词根形式，更加精确。

from nltk.stem import WordNetLemmatizer
import nltknltk.download('wordnet')  # 下载 WordNet 语料库lemmatizer = WordNetLemmatizer()
words = ["running", "flies", "easily", "studies", "better"]lemmatized_words = [lemmatizer.lemmatize(word, pos="v") for word in words]
print("Lemmatization:", lemmatized_words)

输出：

Lemmatization: ['run', 'fly', 'easily', 'study', 'better']

优点：

flies 被正确地还原为 fly
studies 被正确地还原为 study
better 仍保持其正确形式

2.4 停用词（Stopwords）处理

停用词（Stopwords）是指在文本处理中不重要的高频词，如 is, the, and，可以去除以减少模型计算量。

Python 示例

from nltk import word_tokenize
from nltk.corpus import stopwords  # NLTK 提供的停用词库
import nltk
nltk.download('stopwords')  # 下载停用词列表text = "This is a simple NLP example demonstrating stopwords removal."words = word_tokenize(text)filtered_words = [word for word in words if word.lower() not in stopwords.words('english')]
print("去除停用词后:", filtered_words)

输出：

去除停用词后: ['simple', 'NLP', 'example', 'demonstrating', 'stopwords', 'removal', '.']

注意：

is, a, this 被去掉
NLP 等关键词被保留

2.5 难点总结

分词的不同方法：空格分词 vs. NLTK vs. spaCy，适用于不同场景。
词干提取 vs. 词形还原：Stemming 可能会导致错误，而 Lemmatization 更精确但需要额外的词性信息。
停用词的处理：某些 NLP 任务（如情感分析）可能需要保留停用词。

2.6 课后练习

练习 1：文本清理

清理以下文本，去掉 HTML 标签、特殊字符、数字，并转换为小写：

text = "Visit our <b>website</b>: https://example.com!!! Call us at 987-654-3210."

练习 2：使用 spaCy 进行分词

使用 spaCy 对以下文本进行分词：

text = "Natural Language Processing is fun and useful!"

练习 3：词形还原

使用 Lemmatization 处理以下单词：

words = ["running", "mice", "better", "studying"]

练习 4：去除停用词

从以下文本中去除停用词：

text = "This is an example sentence demonstrating stopwords removal."

查看全文

http://www.mrgr.cn/news/90533.html

fps动作系统9：动画音频

【Android开发AI实战】选择目标跟踪基于opencv实现——运动跟踪

django中间件，中间件给下面传值

使用mermaid画流程图

【核心特性】从鸭子类型到Go的io.Writer设计哲学

企语企业管理系iFair(F23.2_a0)在Debian操作系统中的安装

C++基础学习记录—this指针和const关键字

mysql8.0使用MGR实现高可用

Linux内核 - 非仿生机器人之感知主控系统（协议栈）

android studio下载安装汉化-Flutter安装

并发编程---多线程不安全示例以及解决，多线程创建方式

【嵌入式Linux应用开发基础】read函数与write函数

【工业安全】-CVE-2019-17621-D-Link Dir-859L 路由器远程代码执行漏洞

自然语言处理NLP入门 -- 第三节词袋模型与 TF-IDF

haproxy+nginx负载均衡实验

解锁大语言模型潜能：KITE 提示词框架全解析

DeepSeek-V3 技术报告

设计模式全解（含代码实例）

【科技革命】颠覆性力量与社会伦理的再平衡

web前端布局--使用element中的Container布局容器

2.1 文本清理

常见的清理步骤

Python 示例

2.2 分词（Tokenization）

常见分词方法

Python 示例

2.3 词干提取（Stemming）和词形还原（Lemmatization）

词干提取（Stemming）

词形还原（Lemmatization）

2.4 停用词（Stopwords）处理

Python 示例

2.5 难点总结

2.6 课后练习

练习 1：文本清理

练习 2：使用 spaCy 进行分词

练习 3：词形还原

练习 4：去除停用词

相关文章：