基础 Dataframe
1
2
3
4
5
6
7
8
df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
'height':[165, 70, 120, 80, 180, 172, 150],
'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
},
index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])
age | color | food | height | score | state | |
---|---|---|---|---|---|---|
Jane | 30 | blue | Steak | 165 | 4.6 | NY |
Nick | 2 | green | Lamb | 70 | 8.3 | TX |
Aaron | 12 | red | Mango | 120 | 9.0 | FL |
Penelope | 4 | white | Apple | 80 | 3.3 | AL |
Dean | 32 | gray | Cheese | 180 | 1.8 | AK |
Christina | 33 | black | Melon | 172 | 9.5 | TX |
Cornelia | 69 | red | Beans | 150 | 2.2 | TX |
以下例子均基于以上 dataframe 展开。为避免修改源 Dataframe,先进行以上 dataframe 的复制。
1
df1 = df.copy()
基础用法
查看行列标签
查看行索引
1
df1.index
查看列名
1
df1.columns
选择行或列
注意,列表选择的是列,序列选择的是行!
选择多行
按行号取(不含结束点!)
1
df1[0: 2]
按行索引取(含结束点!)
1
df1['Jane': 'Nick']
age | color | food | height | score | state | |
---|---|---|---|---|---|---|
Jane | 30 | blue | Steak | 5’5” | 4.6 | NY |
Nick | 2 | green | Lamb | 2’4” | 8.3 | TX |
当然也可(建议使用)
1
df1.iloc[0: 2, :]
1
df1.loc['Jane': 'Nick', :]
选择多列
注意有两层中括号
1
df1[['color', 'food']]
color | food | |
---|---|---|
Jane | blue | Steak |
Nick | green | Lamb |
Aaron | red | Mango |
Penelope | white | Apple |
Dean | gray | Cheese |
Christina | black | Melon |
Cornelia | red | Beans |
当然也可(建议使用)
1
df1.loc[:, ['color', 'food']]
高级用法
布尔选择行
单条件
1
df1.loc[:, df1.loc['age'] > 10]
age | color | food | height | score | state | |
---|---|---|---|---|---|---|
Jane | 30 | blue | Steak | 165 | 4.6 | NY |
Aaron | 12 | red | Mango | 120 | 9.0 | FL |
Dean | 32 | gray | Cheese | 180 | 1.8 | AK |
Christina | 33 | black | Melon | 172 | 9.5 | TX |
Cornelia | 69 | red | Beans | 150 | 2.2 | TX |
条件:是否在某集合中
DataFrame.isin()
和 Series.within()
。
集合可以为列表(list),集合(set),数据系列(Series),以及数组(Numpy array)。
见 python - How to filter Pandas dataframe using ‘in’ and ‘not in’ like in SQL - Stack Overflow
多条件
注意每个条件加括号
1
df1.loc[(df1['age'] > 10) & (df1['food'].isin({'Steak', 'Lamb', 'Mango'})), :]
或者
1
df1.query("age > 10 and food in ['Steak', 'Lamb', 'Mango']")
age | color | food | height | score | state | |
---|---|---|---|---|---|---|
Jane | 30 | blue | Steak | 165 | 4.6 | NY |
Aaron | 12 | red | Mango | 120 | 9.0 | FL |
布尔选择元素
按条件过滤元素
1
2
df2 = pd.DataFrame({
'A': ['x', 'y', 'z', 'q'], 'B': ['w', 'a', np.nan, 'x'], 'C': np.arange(4)})
A | B | C | |
---|---|---|---|
0 | x | w | 0 |
1 | y | a | 1 |
2 | z | NaN | 2 |
3 | q | x | 3 |
1
2
c1 = ['x', 'w', 'p']
df2[df2[['A', 'B']].isin(c1)]
A | B | C | |
---|---|---|---|
0 | x | w | NaN |
1 | NaN | NaN | NaN |
2 | NaN | NaN | NaN |
3 | NaN | x | NaN |
按任一列满足即可的条件选择行
1
df2[df2[['A', 'B']].isin(c1).any(axis=1)]
A | B | C | |
---|---|---|---|
0 | x | w | 0 |
3 | q | x | 3 |
按所有列满足才可的条件选择行
1
df2[df2[['A', 'B']].isin(c1).all(axis=1)]
A | B | C | |
---|---|---|---|
0 | x | w | 0 |
还有基于 Numpy 和基于 list comprehension 的更多筛选方法,见 python - How to filter Pandas dataframe using ‘in’ and ‘not in’ like in SQL - Stack Overflow
批量更改列
字典映射法
不在字典中的映射为 NaN
1
2
3
dic = {'NY':'New York', 'TX':'Texas', 'FL':'Florida', 'AL':'Alabama'}
df1['state'].map(dic)
df1['state'] = df1['state'].map(dic)
age | color | food | height | score | state | |
---|---|---|---|---|---|---|
Jane | 30 | blue | Steak | 165 | 4.6 | New York |
Nick | 2 | green | Lamb | 70 | 8.3 | Texas |
Aaron | 12 | red | Mango | 120 | 9.0 | Florida |
Penelope | 4 | white | Apple | 80 | 3.3 | Alabama |
Dean | 32 | gray | Cheese | 180 | 1.8 | NaN |
Christina | 33 | black | Melon | 172 | 9.5 | Texas |
Cornelia | 69 | red | Beans | 150 | 2.2 | Texas |
若要让不在字典中的保留原值
1
df1['state']=df1['state'].map(dic).fillna(df['state'])
或者
1
df1['state'].replace(dic, inplace=True)
(来源:Stackoverflow)
函数法
其他参数放在元组里用 args 传入
1
2
3
4
5
6
7
8
9
10
def cm_to_feet_inches(size, int_inches=False):
feet = int(size // 30.48)
inches = size / 2.54 - feet * 12
if int_inches:
return "%d\'%d\"" % (feet, round(inches))
else:
return "%d\'%.2f\"" % (feet, inches)
df1['height'] = df1['height'].apply(cm_to_feet_inches, args=(True,))
age | color | food | height | score | state | |
---|---|---|---|---|---|---|
Jane | 30 | blue | Steak | 5’5” | 4.6 | NY |
Nick | 2 | green | Lamb | 2’4” | 8.3 | TX |
Aaron | 12 | red | Mango | 3’11” | 9.0 | FL |
Penelope | 4 | white | Apple | 2’7” | 3.3 | AL |
Dean | 32 | gray | Cheese | 5’11” | 1.8 | AK |
Christina | 33 | black | Melon | 5’8” | 9.5 | TX |
Cornelia | 69 | red | Beans | 4’11” | 2.2 | TX |
也可使用 lambda 方法
1
df1['height'] = df1['height'].apply(lambda x : cm_to_feet_inches(x, True))
遍历行
数据量大时性能较低,不建议使用
1
2
for index, row in df1.iterrows():
print(index, row['age'], row['color'], sep=', ')
1
2
3
4
5
6
7
Jane, 30, blue
Nick, 2, green
Aaron, 12, red
Penelope, 4, white
Dean, 32, gray
Christina, 33, black
Cornelia, 69, red
倒序遍历
1
2
for index, row in df1.iloc[::-1].iterrows():
print(index, row['age'], row['color'], sep=', ')
1
2
3
4
5
6
7
Cornelia, 69, red
Christina, 33, black
Dean, 32, gray
Penelope, 4, white
Aaron, 12, red
Nick, 2, green
Jane, 30, blue
注意事项
df.index() 返回的不是行数(从0开始那个),而是 index 内容的列表