当前位置：首页 > news >正文

23个Python在自然语言处理中的应用实例

news 2025/2/19 6:48:03

如果你对Python感兴趣的话，可以试试我整理的这一份全套的Python学习资料，【点击这里】免费领取！

1. 文本清洗

文本清洗是任何 NLP 项目的第一步。它涉及去除不需要的信息，如标点符号、数字、特殊字符等。

代码示例：

import redef clean_text(text):# 去除标点符号text = re.sub(r'[^\w\s]', '', text)# 去除数字text = re.sub(r'\d+', '', text)# 将所有字母转为小写text = text.lower()return text# 示例文本
text = "Hello, World! This is an example text with numbers 123 and symbols #@$."
cleaned_text = clean_text(text)print(cleaned_text)  # 输出: hello world this is an example text with numbers and symbols

解释：

使用 re 模块的 sub() 方法去除标点符号和数字。
lower() 方法将所有字母转换为小写。

2. 分词

分词是将文本拆分成单词的过程。这有助于进一步处理，如词频统计、情感分析等。

代码示例：

from nltk.tokenize import word_tokenize# 示例文本
text = "Hello, World! This is an example text."# 分词
tokens = word_tokenize(text)print(tokens)  # 输出: ['Hello', ',', 'World', '!', 'This', 'is', 'an', 'example', 'text', '.']

解释：

使用 nltk 库中的 word_tokenize() 函数进行分词。

3. 去除停用词

停用词是指在文本中频繁出现但对语义贡献较小的词，如“the”、“is”等。

代码示例：

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize# 示例文本
text = "The quick brown fox jumps over the lazy dog."# 分词
tokens = word_tokenize(text)# 去除停用词
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]print(filtered_tokens)  # 输出: ['quick', 'brown', 'fox', 'jumps', 'over', 'lazy', 'dog']

解释：

使用 nltk.corpus.stopwords 获取英语停用词列表。
使用列表推导式过滤掉停用词。

4. 词干提取

词干提取是将单词还原为其基本形式的过程，有助于减少词汇量。

代码示例：

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize# 示例文本
text = "running dogs are barking loudly."# 分词
tokens = word_tokenize(text)# 词干提取
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in tokens]print(stemmed_tokens)  # 输出: ['run', 'dog', 'are', 'bark', 'loudli', '.']

解释：

使用 PorterStemmer 对单词进行词干提取。

5. 词形还原

词形还原类似于词干提取，但它使用词典来找到单词的基本形式。

代码示例：

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize# 示例文本
text = "running dogs are barking loudly."# 分词
tokens = word_tokenize(text)# 词形还原
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]print(lemmatized_tokens)  # 输出: ['running', 'dog', 'are', 'barking', 'loudly', '.']

解释：

使用 WordNetLemmatizer 进行词形还原。

6. 词频统计

词频统计可以帮助我们了解文本中最常见的词汇。

代码示例：

from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
import matplotlib.pyplot as plt# 示例文本
text = "This is a sample text. This text contains some words that are repeated several times."# 分词
tokens = word_tokenize(text)# 计算词频
fdist = FreqDist(tokens)# 绘制词频图
plt.figure(figsize=(10, 5))
fdist.plot(10)
plt.show()

解释：

使用 FreqDist 计算词频。
使用 matplotlib 绘制词频图。

7. 情感分析

情感分析用于判断文本的情感倾向，如正面、负面或中性。

代码示例：

from nltk.sentiment import SentimentIntensityAnalyzer# 示例文本
text = "I love this movie. It's amazing!"# 情感分析
sia = SentimentIntensityAnalyzer()
sentiment_scores = sia.polarity_scores(text)print(sentiment_scores)  # 输出: {'neg': 0.0, 'neu': 0.429, 'pos': 0.571, 'compound': 0.8159}

解释：

使用 SentimentIntensityAnalyzer 进行情感分析。

8. 词向量化

词向量化将单词表示为数值向量，便于计算机处理。

代码示例：

import gensim.downloader as api# 加载预训练的 Word2Vec 模型
model = api.load("glove-twitter-25")# 示例文本
text = "This is a sample sentence."# 分词
tokens = text.split()# 向量化
vectorized_tokens = [model[token] for token in tokens if token in model.key_to_index]print(vectorized_tokens)

解释：

使用 gensim 库加载预训练的 Word2Vec 模型。
将单词转换为向量表示。

9. 主题建模

主题建模用于识别文档集合中的主题。

代码示例：

from gensim import corpora, models# 示例文本
documents = ["Human machine interface for lab abc computer applications","A survey of user opinion of computer system response time","The EPS user interface management system","System and human system engineering testing of EPS","Relation of user perceived response time to error measurement","The generation of random binary unordered trees","The intersection graph of paths in trees","Graph minors IV Widths of trees and well quasi ordering","Graph minors A survey"
]# 分词
texts = [[word for word in document.lower().split()] for document in documents]# 创建词典
dictionary = corpora.Dictionary(texts)# 转换为文档-词频矩阵
corpus = [dictionary.doc2bow(text) for text in texts]# LDA 模型
lda = models.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=10)# 打印主题
for topic in lda.print_topics(num_topics=2, num_words=5):print(topic)

解释：

使用 gensim 库进行主题建模。
使用 LDA 模型识别主题。

10. 文本分类

文本分类是将文本分配给预定义类别的过程。

代码示例：

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score# 示例数据
documents = ["Human machine interface for lab abc computer applications","A survey of user opinion of computer system response time","The EPS user interface management system","System and human system engineering testing of EPS","Relation of user perceived response time to error measurement","The generation of random binary unordered trees","The intersection graph of paths in trees","Graph minors IV Widths of trees and well quasi ordering","Graph minors A survey"
]labels = [0, 0, 0, 0, 0, 1, 1, 1, 1]# 分词
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)# 训练模型
classifier = MultinomialNB()
classifier.fit(X_train, y_train)# 预测
y_pred = classifier.predict(X_test)# 评估准确率
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

解释：

使用 sklearn 库进行文本分类。
使用朴素贝叶斯分类器进行预测。

11. 命名实体识别（NER）

命名实体识别用于识别文本中的特定实体，如人名、地名等。

代码示例：

import spacy# 加载预训练模型
nlp = spacy.load("en_core_web_sm")# 示例文本
text = "Apple is looking at buying U.K. startup for $1 billion."# 处理文本
doc = nlp(text)# 提取实体
for ent in doc.ents:print(ent.text, ent.label_)# 输出:
# Apple ORG
# U.K. GPE
# $1 billion MONEY

解释：

使用 spacy 库进行命名实体识别。
提取文本中的实体及其类型。

12. 机器翻译

机器翻译用于将一种语言的文本转换为另一种语言。

代码示例：

from googletrans import Translator# 创建翻译器对象
translator = Translator()# 示例文本
text = "Hello, how are you?"# 翻译文本
translated_text = translator.translate(text, src='en', dest='fr')print(translated_text.text)  # 输出: Bonjour, comment ça va ?

解释：

使用 googletrans 库进行文本翻译。
将英文文本翻译成法文。

13. 文本摘要

文本摘要是生成文本的简洁版本，保留主要信息。

代码示例：

from transformers import pipeline# 创建摘要生成器
summarizer = pipeline("summarization")# 示例文本
text = """
Natural language processing (NLP) is a subfield of linguistics, computer science, 
and artificial intelligence concerned with the interactions between computers and 
human (natural) languages. As such, NLP is related to the area of human–computer interaction.
Many challenges in NLP involve natural language understanding, that is, enabling computers 
to derive meaning from human or natural language input, and others involve natural language 
generation.
"""# 生成摘要
summary = summarizer(text, max_length=100, min_length=30, do_sample=False)print(summary[0]['summary_text'])

14. 词云生成

词云是一种可视化工具，可以直观地展示文本中最常出现的词汇。

代码示例：

from wordcloud import WordCloud
import matplotlib.pyplot as plt# 示例文本
text = "Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human (natural) languages."# 生成词云
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)# 显示词云
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

解释：

使用 wordcloud 库生成词云。
设置词云的宽度、高度和背景颜色。
使用 matplotlib 显示词云图像。

15. 问答系统

问答系统用于回答用户提出的问题。

代码示例：

from transformers import pipeline# 创建问答模型
qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")# 示例问题和上下文
context = "Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human (natural) languages."
question = "What is NLP?"# 生成答案
answer = qa_pipeline(question=question, context=context)print(answer['answer'])  # 输出: Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human (natural) languages.

解释：

使用 transformers 库创建问答模型。
提供问题和上下文文本。
生成答案并打印。

16. 信息抽取

信息抽取是从非结构化文本中提取有用信息的过程。

代码示例：

from transformers import pipeline# 创建信息抽取模型
ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cuneiform-sumerian-ner")# 示例文本
text = "Sargon was a king of Akkad."# 提取信息
entities = ner_pipeline(text)print(entities)
# 输出:
# [{'entity': 'B-PER', 'score': 0.9999799728393555, 'index': 0, 'word': 'Sargon', 'start': 0, 'end': 6},
#  {'entity': 'B-LOC', 'score': 0.9999675750732422, 'index': 5, 'word': 'Akkad', 'start': 14, 'end': 19}]

解释：

使用 transformers 库创建信息抽取模型。
提取文本中的实体及其类型。
打印提取结果。

17. 关系抽取

关系抽取是从文本中识别实体之间的关系。

代码示例：

from transformers import pipeline# 创建关系抽取模型
re_pipeline = pipeline("relation-extraction", model="joeddav/xlm-roberta-large-xnli")# 示例文本
text = "Sargon was a king of Akkad."# 定义实体对
entity_pairs = [{"entity": "Sargon", "offset": (0, 6)},{"entity": "king", "offset": (10, 14)},{"entity": "Akkad", "offset": (17, 22)}
]# 提取关系
relations = re_pipeline(text, entity_pairs)print(relations)
# 输出:
# [{'score': 0.9999675750732422, 'entity': 'was a', 'label': 'is_a', 'entity_pair': {'entity_0': 'Sargon', 'entity_1': 'king'}, 'index': 0, 'confidence': 0.9999675750732422}]

解释：

使用 transformers 库创建关系抽取模型。
定义实体对。
提取实体之间的关系。
打印提取结果。

18. 文本聚类

文本聚类是将相似的文档归为一类的过程。

代码示例：

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score# 示例文本
documents = ["Human machine interface for lab abc computer applications","A survey of user opinion of computer system response time","The EPS user interface management system","System and human system engineering testing of EPS","Relation of user perceived response time to error measurement","The generation of random binary unordered trees","The intersection graph of paths in trees","Graph minors IV Widths of trees and well quasi ordering","Graph minors A survey"
]# TF-IDF 向量化
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)# K-Means 聚类
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)# 评估聚类质量
silhouette_avg = silhouette_score(X, kmeans.labels_)
print(f"Silhouette Score: {silhouette_avg:.2f}")# 打印聚类结果
for i, doc in enumerate(documents):print(f"{doc} -> Cluster {kmeans.labels_[i]}")

解释：

使用 TfidfVectorizer 对文档进行 TF-IDF 向量化。
使用 KMeans 进行聚类。
评估聚类质量。
打印每个文档的聚类结果。

19. 事件检测

事件检测是从文本中识别特定事件的过程。

代码示例：

from transformers import pipeline# 创建事件检测模型
event_pipeline = pipeline("event-extraction", model="microsoft/layoutlmv2-base-uncased-finetuned-funsd")# 示例文本
text = "The company announced a new product launch on Monday."# 事件检测
events = event_pipeline(text)print(events)
# 输出:
# [{'event_type': 'Product Launch', 'trigger': 'launch', 'trigger_start': 35, 'trigger_end': 40, 'arguments': [{'entity': 'company', 'entity_start': 4, 'entity_end': 10, 'role': 'Company'}, {'entity': 'Monday', 'entity_start': 38, 'entity_end': 44, 'role': 'Date'}]}]

解释：

使用 transformers 库创建事件检测模型。
提取文本中的事件及其触发词和参数。
打印事件检测结果。

20. 词性标注

词性标注是将文本中的每个单词标记为其对应的词性。

代码示例：

from nltk import pos_tag
from nltk.tokenize import word_tokenize# 示例文本
text = "John likes to watch movies. Mary likes movies too."# 分词
tokens = word_tokenize(text)# 词性标注
tagged_tokens = pos_tag(tokens)print(tagged_tokens)
# 输出:
# [('John', 'NNP'), ('likes', 'VBZ'), ('to', 'TO'), ('watch', 'VB'), ('movies', 'NNS'), ('.', '.'), ('Mary', 'NNP'), ('likes', 'VBZ'), ('movies', 'NNS'), ('too', 'RB'), ('.', '.')]

解释：

使用 nltk 库进行分词。
使用 pos_tag 进行词性标注。
打印标注结果。

21. 依存句法分析

依存句法分析是分析句子中词与词之间的依存关系。

代码示例：

import spacy# 加载预训练模型
nlp = spacy.load("en_core_web_sm")# 示例文本
text = "John likes to watch movies. Mary likes movies too."# 处理文本
doc = nlp(text)# 依存句法分析
for token in doc:print(token.text, token.dep_, token.head.text, token.head.pos_,[child for child in token.children])# 输出:
# John nsubj likes VERB []
# likes ROOT likes VERB [to]
# to mark likes VERB [watch]
# watch xcomp likes VERB []
# movies dobj likes VERB []
# . punct likes PUNCT []
# Mary nsubj likes VERB []
# likes ROOT likes VERB []
# movies dobj likes VERB []
# too advmod likes VERB []
# . punct likes PUNCT []

解释：

使用 spacy 库进行依存句法分析。
打印每个词的依存关系及其父节点和子节点。

22. 语法树构建

语法树构建是将句子的语法结构表示为树状结构。

代码示例：

import nltk
from nltk import Tree# 示例文本
text = "John likes to watch movies. Mary likes movies too."# 分词
tokens = nltk.word_tokenize(text)# 词性标注
tagged_tokens = nltk.pos_tag(tokens)# 构建语法树
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(tagged_tokens)# 显示语法树
result.draw()

解释：

使用 nltk 库进行分词和词性标注。
使用正则表达式构建语法树。
使用 draw 方法显示语法树。

23. 词性转换

词性转换是将一个词从一种词性转换为另一种词性。

代码示例：

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet# 示例文本
text = "running dogs are barking loudly."# 分词
tokens = text.split()# 词性转换
lemmatizer = WordNetLemmatizer()
converted_tokens = []for token in tokens:# 获取词性pos = wordnet.NOUN if token.endswith('ing') else wordnet.VERBconverted_token = lemmatizer.lemmatize(token, pos=pos)converted_tokens.append(converted_token)print(converted_tokens)
# 输出:
# ['run', 'dog', 'are', 'bark', 'loudli', '.']

解释：

使用 WordNetLemmatizer 进行词性转换。
根据词尾判断词性。
打印转换后的结果。

实战案例：情感分析在电商评论中的应用

假设我们正在为一家电商平台开发一个情感分析系统，用于自动分析用户评论的情感倾向。具体步骤如下：

1. 数据收集：

收集电商平台上的用户评论数据。

2. 数据预处理：

清洗文本数据，去除无关信息。
分词并去除停用词。

3. 情感分析：

使用 SentimentIntensityAnalyzer 进行情感分析。
计算每个评论的情感得分。

4. 结果展示：

将分析结果可视化，展示正面、负面和中性评论的比例。

代码示例：

import pandas as pd
from nltk.sentiment import SentimentIntensityAnalyzer
import matplotlib.pyplot as plt# 加载评论数据
data = pd.read_csv('reviews.csv')
comments = data['comment'].tolist()# 情感分析
sia = SentimentIntensityAnalyzer()sentiments = []
for comment in comments:sentiment_scores = sia.polarity_scores(comment)sentiments.append(sentiment_scores['compound'])# 计算情感类别
positive_count = sum(1 for score in sentiments if score > 0)
negative_count = sum(1 for score in sentiments if score < 0)
neutral_count = sum(1 for score in sentiments if score == 0)# 可视化结果
labels = ['Positive', 'Negative', 'Neutral']
sizes = [positive_count, negative_count, neutral_count]plt.figure(figsize=(8, 8))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=140)
plt.title('Sentiment Analysis of Product Reviews')
plt.show()

解释：