pandas中数据的合并
目录
- pd.concat函数
- concat中join参数的区别
- merge函数
- merge中indicator参数
pd.concat函数
pd.concat 是 Pandas 库中用于连接(concatenate)多个 Pandas 对象(例如 Series 和 DataFrame)的函数。它可以沿着指定的轴(行或列)将多个对象连接在一起。
这个函数的基本语法是:
pd.concat(objs, axis=0, join='outer', ignore_index=False)
- objs:要连接的 Pandas 对象列表,例如 DataFrame 或 Series。
- axis:指定连接的轴,0表示沿着行的方向连接,1 表示沿着列的方向连接。
- join:指定连接的方式,‘outer’ 表示取并集,‘inner’ 表示取交集。
- ignore_index:如果设置为 True,则忽略原始索引并生成新的整数索引。
示例
import numpy as np
import pandas as pddf1=pd.DataFrame(np.ones((3,4))*0,columns =['A','B','C','D'])
df2=pd.DataFrame(np.ones((3,4))*1,columns =['A','B','C','D'])
df3=pd.DataFrame(np.ones((3,4))*2,columns =['A','B','C','D'])print(df1)
print(df2)
print(df3)#concatenating
print(pd.concat([df1,df2,df3],axis=0,ignore_index =True))A B C D
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0A B C D
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0A B C D
0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0A B C D
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
6 2.0 2.0 2.0 2.0
7 2.0 2.0 2.0 2.0
8 2.0 2.0 2.0 2.0
concat中join参数的区别
类似hivesql,inner 就是求交集,outer就是求并集
import numpy as np
import pandas as pddf1=pd.DataFrame(np.ones((3,4))*0,columns =['A','B','C','D'],index=[1,2,3])
df2=pd.DataFrame(np.ones((3,4))*1,columns =['A','B','C','D'],index=[2,3,4])
s1=pd.Series([1,2,3,4],index=['A','B','C','D'])##print(df1)
##print(df2)
##print(s1)## join [outer/inner]
res1 = pd.concat((df1,df2),ignore_index =True)#默认outer
print(res1)res2 = pd.concat([df1,df2],join ='outer',ignore_index =True) #默认outer
print(res2)
res3 = pd.concat([df1,df2],join ='inner',ignore_index =True,axis=1) #默认outer
print(res3)
res4 = pd.concat([df1,df2],axis=1)
print(res4)
res5 = pd.concat([df1, df2], axis=1).reindex(df1.index)
print(res5)A B C D
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0A B C D
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.00 1 2 3 4 5 6 7
2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0A B C D A B C D
1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
4 NaN NaN NaN NaN 1.0 1.0 1.0 1.0A B C D A B C D
1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
merge函数
pd.merge() 是 Pandas 中用于合并数据的函数,类似于 SQL 中的 JOIN 操作。它允许您根据一个或多个键(key)将两个 DataFrame 连接在一起,从而进行数据整合和分析。
基本语法如下:
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, suffixes=('_x', '_y'), indicator=False, validate=None)
参数说明:
- left 和 right:要合并的两个 DataFrame。
- how:指定连接方式,包括 ‘left’、‘right’、'outer’和 ‘inner’,默认为 ‘inner’。
- on:指定用来进行合并的列名,必须同时存在于两个 DataFrame 中。 left_on 和 right_on:分别指定左右两个 DataFrame 中用于合并的列名,当要合并的列名不同时使用。
- left_index 和right_index:如果为 True,则使用左/右 DataFrame 的索引作为连接键。
- suffixes:如果列名冲突,用于追加到重叠列名末尾的字符串元组。
- indicator:添加一个指示器列,显示每行的合并情况。
- validate:检查合并类型的有效性,可以是"one_to_one"、“one_to_many”、“many_to_one”、“many_to_many”。
import pandas as pd
left =pd.DataFrame({'key':['K0','K1','K2','K3'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']})right =pd.DataFrame({'key':['K0','K1','K2','K3'],'C':['C0','C1','C2','C3'],'D':['D0','D1','D2','D3']})res =pd.merge(left,right,on='key')print(res)left1 =pd.DataFrame({'key1':['K0','K1','K2','K1'],'key0':['K0','K1','K2','K3'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']})right1 =pd.DataFrame({'key1':['K0','K2','K2','K3'],'key0':['K0','K1','K2','K3'],'C':['C0','C1','C2','C3'],'D':['D0','D1','D2','D3']})#how =[‘left’,‘right’,‘inner’,‘outer’]
res1 =pd.merge(left1,right1,on=['key1','key0'],how ='inner')
print(res1)key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3key1 key0 A B C D
0 K0 K0 A0 B0 C0 D0
1 K2 K2 A2 B2 C2 D2
merge中indicator参数
import pandas as pdleft1 =pd.DataFrame({'key1':['K0','K1','K2','K1'],'key0':['K0','K1','K2','K3'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']})right1 =pd.DataFrame({'key1':['K0','K2','K2','K3'],'key0':['K0','K1','K2','K3'],'C':['C0','C1','C2','C3'],'D':['D0','D1','D2','D3']})#indicatorres1 =pd.merge(left1,right1,on='key1',how ='outer',indicator =True)
print(res1)
res2 =pd.merge(left1,right1,on='key1',how ='outer',indicator ='indicator_column')
print(res2)key1 key0_x A B key0_y C D _merge
0 K0 K0 A0 B0 K0 C0 D0 both
1 K1 K1 A1 B1 NaN NaN NaN left_only
2 K1 K3 A3 B3 NaN NaN NaN left_only
3 K2 K2 A2 B2 K1 C1 D1 both
4 K2 K2 A2 B2 K2 C2 D2 both
5 K3 NaN NaN NaN K3 C3 D3 right_onlykey1 key0_x A B key0_y C D indicator_column
0 K0 K0 A0 B0 K0 C0 D0 both
1 K1 K1 A1 B1 NaN NaN NaN left_only
2 K1 K3 A3 B3 NaN NaN NaN left_only
3 K2 K2 A2 B2 K1 C1 D1 both
4 K2 K2 A2 B2 K2 C2 D2 both
5 K3 NaN NaN NaN K3 C3 D3 right_only
import pandas as pdleft =pd.DataFrame({'A':['A0','A1','A2'],'B':['B0','B1','B2']},index =['K0','K1','K2'])
right =pd.DataFrame({'C':['C0','C1','C2'],'D':['D0','D1','D2']},index =['K0','K2','K3'])res3 = pd.merge(left,right,left_index =True,right_index =True,how ='outer')
print(res3)
#
boys =pd.DataFrame({'K':['K0','K1','K2'],'age':[24,27,30]})
girls =pd.DataFrame({'K':['K0','K0','K3'],'age':['22','25','28']})
res4 = pd.merge(boys,girls,on ='K',suffixes =['_boys','_girls'],how ='outer')
print(res4)A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C1 D1
K3 NaN NaN C2 D2K age_boys age_girls
0 K0 24.0 22
1 K0 24.0 25
2 K1 27.0 NaN
3 K2 30.0 NaN
4 K3 NaN 28