Python 之 Pandas 常用操作
Python的Pandas是一个基于Python构建的开源数据分析库,它提供了强大的数据结构和运算功能。主要的数据结构包括 Series 和 DataFrame。
- Series:一维数组,类似于Numpy中的一维array,但具有索引标签,可以保存不同类型的数据,如字符串、布尔值、数字等。
- DataFrame:二维表格型数据结构,与SQL表或Excel工作表类似,每列可以是不同的数据类型(如数值、字符串或日期),并且具有列名和行索引。DataFrame是Pandas的核心数据结构,提供了丰富的数据操作方法。
Series
对象创建
系列创建主要有两种方式,通过字典创建或通过列表创建。
通过字典创建时,字典的 key 默认就是系列的 index。
通过列表创建时,索引则自动从 0 开始递增。
import pandas as pdif __name__ == '__main__':# 从字典创建data = {'a': 0, 'b': 1, 'c': 2}print(pd.Series(data))# a 0# b 1# c 2# dtype: int64# 从列表创建data = [18, 30, 25, 40]print(pd.Series(data))# 0 18# 1 30# 2 25# 3 40# dtype: int64
添加索引
当然,我们可以给系列重新设置索引(但是记得索引的长度和数据长度要保持一致)。
import pandas as pdif __name__ == '__main__':# 从列表创建data = [18, 30, 25, 40]user_age = pd.Series(data)user_age.index = ["Tom", "Bob", "Mary", "James"]print(user_age)# Tom 18# Bob 30# Mary 25# James 40# dtype: int64
新增数据
可以通过 _append 方法追加数据(我记得老版本时直接可以用 append),而且追加的数据不是自修改操作,必须重新赋值后才会生效。
import pandas as pdif __name__ == '__main__':# 从列表创建data = [18, 30, 25, 40]user_age = pd.Series(data)user_age.index = ["Tom", "Bob", "Mary", "James"]user_age = user_age._append(pd.Series({"Looking": 100}))print(user_age)# Tom 18# Bob 30# Mary 25# James 40# Looking 100# dtype: int64
或者更简单的方法,通过对新的 index 进行赋值的方式,达到新增数据的目的。
import pandas as pdif __name__ == '__main__':# 从列表创建data = [18, 30, 25, 40]user_age = pd.Series(data)user_age.index = ["Tom", "Bob", "Mary", "James"]user_age.at["Looking"] = 99print(user_age)# Tom 18# Bob 30# Mary 25# James 40# Looking 99# dtype: int64
修改数据
可以直接通过索引进行数据修改。也可以使用 at 或者 loc 定位 index 来修改。甚至还可以通过 iloc的原始 index 来进行修改(原始索引就是默认的 0、1、2、... )。
import pandas as pdif __name__ == '__main__':# 从列表创建data = [18, 30, 25, 40]user_age = pd.Series(data)user_age.index = ["Tom", "Bob", "Mary", "James"]user_age["Tom"] = 100user_age.at["Bob"] = 99user_age.loc["Mary"] = 98user_age.iloc[3] = 97 # 通过原始 index 修改print(user_age)# Tom 100# Bob 99# Mary 98# James 97# dtype: int64
删除数据
删除使用 drop 方法进行删除数据。
import pandas as pdif __name__ == '__main__':# 从列表创建data = [18, 30, 25, 40]user_age = pd.Series(data)user_age.index = ["Tom", "Bob", "Mary", "James"]user_age = user_age.drop("Tom")print(user_age)# Bob 30# Mary 25# James 40# dtype: int64
数据排序
可以按照索引 index 或者 value 进行升序或者降序进行数据排列。
import pandas as pdif __name__ == '__main__':# 从列表创建data = [18, 30, 25, 40]user_age = pd.Series(data)user_age.index = ["Tom", "Bob", "Mary", "James"]# user_age = user_age.sort_index(ascending=True)user_age = user_age.sort_values(ascending=False)print(user_age)# James 40# Bob 30# Mary 25# Tom 18# dtype: int64
数据查询
基本上可以当成列表那样去进行操作,下面只列举了一部分。输出的 dtype 表示对应 value 的数据类型,如果数据类型不一致,则输出 object。
import pandas as pdif __name__ == '__main__':# 从列表创建data = [18, 30, 25, 40]user_age = pd.Series(data)user_age.index = ["Tom", "Bob", "Mary", "James"]print(user_age[:2])# Tom 18# Bob 30# dtype: int64print(user_age[2:])# Mary 25# James 40# dtype: int64print(list(user_age.keys())) # 对应的索引列表# ['Tom', 'Bob', 'Mary', 'James']print(user_age.values)# [18 30 25 40]
数据操作
加减乘除
系列的加减乘除也是按照相同的 index 来计算的。如果其中一个系列缺少对应的 key,则最终计算的结果为 NaN。当然,我们可以在计算时指定 fill_value,当计算时出现 key 缺失的情况,则使用指定的 fill_value 作为缺省的默认值参与计算。
import pandas as pdif __name__ == '__main__':data = [18, 30, 25, 40]s1 = pd.Series(data)s1.index = ["Tom", "Bob", "Mary", "James"]s2 = pd.Series([1, 2, 3, 4], index=["Tom", "Bob", "Mary", "Looking"])print(s1.add(s2))# Bob 32.0# James NaN# Looking NaN# Mary 28.0# Tom 19.0# dtype: float64print(s1.sub(s2, fill_value=0))# Bob 28.0# James 40.0# Looking -4.0# Mary 22.0# Tom 17.0# dtype: float64print(s1.mul(s2))# Bob 60.0# James NaN# Looking NaN# Mary 75.0# Tom 18.0# dtype: float64print(s1.div(s2))# Bob 15.000000# James NaN# Looking NaN# Mary 8.333333# Tom 18.000000# dtype: float64
数值统计
import pandas as pdif __name__ == '__main__':data = [18, 30, 25, 40]s1 = pd.Series(data)s1.index = ["Tom", "Bob", "Mary", "James"]print(s1.describe())# count 4.000000# mean 28.250000# std 9.251126# min 18.000000# 25% 23.250000# 50% 27.500000# 75% 32.500000# max 40.000000# dtype: float64
apply
对系列的每个 value 操作生成新的系列,输入则是一个函数(功能有点类似 Python 内置的 map 方法)。
import pandas as pddef apply_function(age):return age + 100if __name__ == '__main__':data = [18, 30, 25, 40]s1 = pd.Series(data)s1.index = ["Tom", "Bob", "Mary", "James"]s2 = s1.apply(apply_function)print(s2)# Tom 118# Bob 130# Mary 125# James 140# dtype: int64
DataFrame
对象创建
DataFrame 数据帧默认索引也是 0、1、2、...,创建时可以重新指定新的 index。如果是用 Series 拼接生成 DataFrame,记得 Series 的 index 要和 DataFrame 的 index 保持一致(index 的顺序就算不一致也不会影响,会自动根据 index 将相同 key 的数据对应起来)。
import pandas as pdif __name__ == '__main__':# 从字典创建index = pd.Index(data=["Tom", "Bob", "Mary", "James"])data = {"age": pd.Series([18, 30, 25, 40], index=["Bob", "Tom", "Mary", "James"]),"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]}user_info = pd.DataFrame(data=data, index=index)print(user_info)# age city# Tom 18 BeiJing# Bob 30 ShangHai# Mary 25 GuangZhou# James 40 ShenZhen
数据查询
import pandas as pdif __name__ == '__main__':# 从字典创建index = pd.Index(data=["Tom", "Bob", "Mary", "James"])data = {"age": [18, 30, 25, 40],"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]}user_info = pd.DataFrame(data=data, index=index)print(user_info.values)# [[18 'BeiJing']# [30 'ShangHai']# [25 'GuangZhou']# [40 'ShenZhen']]print(user_info.index)# Index(['Tom', 'Bob', 'Mary', 'James'], dtype='object')print(user_info.columns)# Index(['age', 'city'], dtype='object')
行列互换
可以理解为数学中矩阵的转置
import pandas as pdif __name__ == '__main__':index = pd.Index(data=["Tom", "Bob", "Mary", "James"])data = {"age": [18, 30, 25, 40],"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]}user_info = pd.DataFrame(data=data, index=index)print(user_info.T)# Tom Bob Mary James# age 18 30 25 40# city BeiJing ShangHai GuangZhou ShenZhen
数据提取
列提取
提取指定的一列或多列,行不限制,取所有行。
import pandas as pdif __name__ == '__main__':index = pd.Index(data=["Tom", "Bob", "Mary", "James"])data = {"age": [18, 30, 25, 40],"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]}user_info = pd.DataFrame(data=data, index=index)print(user_info[["city", "age"]])# city age# Tom BeiJing 18# Bob ShangHai 30# Mary GuangZhou 25# James ShenZhen 40print(user_info["city"])# Tom BeiJing# Bob ShangHai# Mary GuangZhou# James ShenZhen# Name: city, dtype: object
行提取
提取指定的一行或多行,列不限制,取所有列。
import pandas as pdif __name__ == '__main__':index = pd.Index(data=["Tom", "Bob", "Mary", "James"])data = {"age": [18, 30, 25, 40],"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]}user_info = pd.DataFrame(data=data, index=index)print(user_info.loc["Tom"])# age 18# city BeiJing# Name: Tom, dtype: objectprint(user_info.iloc[3])# age 40# city ShenZhen# Name: James, dtype: objectprint(user_info.iloc[1:3])# age city# Bob 30 ShangHai# Mary 25 GuangZhou
行列切片
通过 : 来指定对应行列的范围,与列表的切片操作类似,只不过针对二维数组,需要对行和列都需要进行限制。
import pandas as pdif __name__ == '__main__':index = pd.Index(data=["Tom", "Bob", "Mary", "James"])data = {"age": [18, 30, 25, 40],"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]}user_info = pd.DataFrame(data=data, index=index)print(user_info.loc["Tom":"Mary", "age":"city"])# age city# Tom 18 BeiJing# Bob 30 ShangHai# Mary 25 GuangZhou
数据判断和筛选
import pandas as pdif __name__ == '__main__':index = pd.Index(data=["Tom", "Bob", "Mary", "James"])data = {"age": [18, 30, 25, 40],"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]}user_info = pd.DataFrame(data=data, index=index)print(user_info["age"] > 20) # 相当于生成一个筛选器# Tom False# Bob True# Mary True# James True# Name: age, dtype: boolprint(user_info[user_info["age"] > 20])# age city# Bob 30 ShangHai# Mary 25 GuangZhou# James 40 ShenZhen
删除数据
行删除
根据行索引删除指定行
import pandas as pdif __name__ == '__main__':index = pd.Index(data=["Tom", "Bob", "Mary", "James"])data = {"age": [18, 30, 25, 40],"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]}user_info = pd.DataFrame(data=data, index=index)user_info = user_info.drop(["Tom"], axis=0)print(user_info)# age city# Bob 30 ShangHai# Mary 25 GuangZhou# James 40 ShenZhenuser_info = user_info.drop(["James", "Mary"], axis=0)print(user_info)# age city# Bob 30 ShangHai
列删除
根据列索引删除指定列
import pandas as pdif __name__ == '__main__':index = pd.Index(data=["Tom", "Bob", "Mary", "James"])data = {"age": [18, 30, 25, 40],"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]}user_info = pd.DataFrame(data=data, index=index)user_info = user_info.drop(["age"], axis=1)print(user_info)# city# Tom BeiJing# Bob ShangHai# Mary GuangZhou# James ShenZhen
数据操作
apply
和系列的 apply 比较类似,可以对某一列的数据进行操作。比如进行数据格式化,或者进行简单的数据转换生成新的数据
import pandas as pdif __name__ == '__main__':index = pd.Index(data=["Tom", "Bob", "Mary", "James"])data = {"age": [18, 30, 25, 40],"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]}user_info = pd.DataFrame(data=data, index=index)user_info["new_age"] = user_info["age"].apply(lambda item: item + 100)print(user_info)# age city new_age# Tom 18 BeiJing 118# Bob 30 ShangHai 130# Mary 25 GuangZhou 125# James 40 ShenZhen 140
concat
类似数据库的 union 操作。
import pandas as pdif __name__ == '__main__':index = pd.Index(data=["Tom", "Bob", "Mary", "James"])data = {"age": [18, 30, 25, 40],"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]}s1 = pd.DataFrame(data=data, index=index)index = pd.Index(data=["Looking", "Sandra"])data = {"age": [10, 20],"city": ["ChongQing", "XiAn"]}s2 = pd.DataFrame(data=data, index=index)s = pd.concat([s1, s2])print(s)# age city# Tom 18 BeiJing# Bob 30 ShangHai# Mary 25 GuangZhou# James 40 ShenZhen# Looking 10 ChongQing# Sandra 20 XiAn
merge
类似数据库的 join 操作。
import pandas as pdif __name__ == '__main__':index = pd.Index(data=["Tom", "Bob", "Mary", "James"])data = {"age": [18, 30, 25, 40],"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]}s1 = pd.DataFrame(data=data, index=index)index = pd.Index(data=["Looking", "Sandra"])data = {"age": [10, 20],"city": ["ShangHai", "GuangZhou"]}s2 = pd.DataFrame(data=data, index=index)s = pd.merge(s1, s2, on="city")print(s)# age_x city age_y# 0 30 ShangHai 10# 1 25 GuangZhou 20
生成 json
可以根据不同的需要,生成指定类型的 json 串。
import jsonimport pandas as pdif __name__ == '__main__':index = pd.Index(data=["Tom", "Bob", "Mary", "James"])data = {"age": [18, 30, 25, 40],"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]}s1 = pd.DataFrame(data=data, index=index)print(s1.to_json(orient="records", indent=2)) # 每行转换成一个字段,组成数组# [# {# "age":18,# "city":"BeiJing"# },# {# "age":30,# "city":"ShangHai"# },# {# "age":25,# "city":"GuangZhou"# },# {# "age":40,# "city":"ShenZhen"# }# ]print(s1.to_json(orient="split", indent=2)) # 将索引,列和数据分别返回# {# "columns":[# "age",# "city"# ],# "index":[# "Tom",# "Bob",# "Mary",# "James"# ],# "data":[# [# 18,# "BeiJing"# ],# [# 30,# "ShangHai"# ],# [# 25,# "GuangZhou"# ],# [# 40,# "ShenZhen"# ]# ]# }print(s1.to_json(orient="values", indent=2)) # 只返回值,以二维数组形式返回# [# [# 18,# "BeiJing"# ],# [# 30,# "ShangHai"# ],# [# 25,# "GuangZhou"# ],# [# 40,# "ShenZhen"# ]# ]print(s1.to_json(orient="index", indent=2)) # index 为一级 key,column 为二级 key# {# "Tom":{# "age":18,# "city":"BeiJing"# },# "Bob":{# "age":30,# "city":"ShangHai"# },# "Mary":{# "age":25,# "city":"GuangZhou"# },# "James":{# "age":40,# "city":"ShenZhen"# }# }print(s1.to_json(orient="columns", indent=2)) # column 为一级 key,index 为二级 key# {# "age":{# "Tom":18,# "Bob":30,# "Mary":25,# "James":40# },# "city":{# "Tom":"BeiJing",# "Bob":"ShangHai",# "Mary":"GuangZhou",# "James":"ShenZhen"# }# }