当前位置: 首页 > news >正文

数据不一致

数据不一致

解决一些数据不一致问题
例如:“usa” 和 "USA"是一个东西,让它们在数据中变成一样。

# modules we'll use
import pandas as pd
import numpy as np# helpful modules
import fuzzywuzzy
from fuzzywuzzy import process
import charset_normalizer# read in all our data
professors = pd.read_csv("../input/pakistan-intellectual-capital/pakistan_intellectual_capital.csv")# set seed for reproducibility
np.random.seed(0)
# get all the unique values in the 'Country' column
countries = professors['Country'].unique()# sort them alphabetically and then take a closer look
countries.sort()
countriesarray([' Germany', ' New Zealand', ' Sweden', ' USA', 'Australia','Austria', 'Canada', 'China', 'Finland', 'France', 'Greece','HongKong', 'Ireland', 'Italy', 'Japan', 'Macau', 'Malaysia','Mauritius', 'Netherland', 'New Zealand', 'Norway', 'Pakistan','Portugal', 'Russian Federation', 'Saudi Arabia', 'Scotland','Singapore', 'South Korea', 'SouthKorea', 'Spain', 'Sweden','Thailand', 'Turkey', 'UK', 'USA', 'USofA', 'Urbana', 'germany'],dtype=object)
# convert to lower case
# 全部转变为小写
professors['Country'] = professors['Country'].str.lower()# remove trailing white spaces
# 去掉尾部的空格
professors['Country'] = professors['Country'].str.strip()

使用 fuzzy 库

Fuzzy matching ——模糊匹配

# get the top 10 closest matches to "south korea"
matches = fuzzywuzzy.process.extract("south korea", countries, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)# take a look at them
matches[('south korea', 100),('southkorea', 48),('saudi arabia', 43),('norway', 35),('austria', 33),('ireland', 33),('pakistan', 32),('portugal', 32),('scotland', 32),('australia', 30)]
# function to replace rows in the provided column of the provided dataframe
# that match the provided string above the provided ratio with the provided string
def replace_matches_in_column(df, column, string_to_match, min_ratio = 47):# get a list of unique stringsstrings = df[column].unique()# get the top 10 closest matches to our input stringmatches = fuzzywuzzy.process.extract(string_to_match, strings, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)# only get matches with a ratio > 90close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]# get the rows of all the close matches in our dataframerows_with_matches = df[column].isin(close_matches)# replace all rows with close matches with the input matches df.loc[rows_with_matches, column] = string_to_match# let us know the function's doneprint("All done!")# use the function we just wrote to replace close matches to "south korea" with "south korea"
replace_matches_in_column(df=professors, column='Country', string_to_match="south korea")

http://www.mrgr.cn/news/62522.html

相关文章:

  • 升级 Spring Boot 3 配置讲解 —— 为何 SpringBoot3 淘汰了 JDK8?
  • 【苏德矿高等数学】第4讲:数列极限定义-1
  • 基于Elasticsearch8的向量检索实现相似图形搜索
  • 在DVWA靶机从渗透到控制(weevely和中国蚁剑)
  • 如何在读博过程中缓解压力
  • 初学Linux电源管理
  • Java爬虫:在1688上“拍立淘”——按图搜索商品的奇妙之旅
  • AdaBoost与前向分步算法 10-16最小化指数损失函数 公式解析
  • react-路由
  • K8S自建企业私有云方案 单台起配 NVMe全闪存储性能
  • SpringBoot常用注解
  • 电脑使用技巧:电脑分区如何合并?
  • YOLO11论文 | 实用脚本 | 绘制多个实验的loss、mAP@0.5、mAP@0.5:0.95的高级图像【科研必备 + 绘图神器】
  • 性价比高的挂耳式耳机有哪些?五大性价比高的挂耳式耳机推荐
  • java 集合类详解
  • Navigation组件页面跳转
  • day14:RSYNC同步
  • Python自动化数据备份与同步
  • 工业设计岗位18K招聘,要求必须会AI,但AI究竟该怎么学呢?
  • HJ108 求最小公倍数
  • 《JVM第3课》运行时数据区
  • Java开发者的Python快速进修指南:文件操作
  • 键盘监听事件xss攻击
  • C++学习日记 | LAB 11 类中的动态内存管理
  • (五)Web前端开发进阶2——AJAX
  • Fsm3