Question

如何使用groupby.first（）将数据框列的值替换为另一列的值？

2020-02-19

57

python pandas dataframe replace pandas-groupby

我有这样的 df：

               Value1     Value2
2008-01-01       -1          4
2008-01-01       -1          5
2008-01-03       -1          6
2008-02-25        0          7
2008-02-26       -1          8
2008-02-27        0          9
2008-03-02        5         10 
2008-03-16       -1         11
2008-03-17       -1         12 
2009-04-04       -1         13
2009-04-07        0         14

我想检查每个 Year-Month 的 第一个 value1 。如果它小于 0，我想用 value1 替换 value2。我该怎么做？

在此示例中，结果应为：

               Value1     Value2
2008-01-01       -1         -1
2008-01-01       -1          5
2008-01-03       -1          6
2008-02-25        0          7
2008-02-26       -1          8
2008-02-27        0          9
2008-03-02        5         10 
2008-03-16       -1         11
2008-03-17       -1         12 
2009-04-04       -1         -1
2009-04-07        0         14

因为只有 2008-01、2009-04 第一个 value1 为负数， 2008-2、2008-03 第一个 value1 为正数，所以保留它。

我用过：

g = df.groupby([df.index.year,df.index.month])

if g['value1'].first() < 0:
   g['value1'].first() = g['value2'].first()

它似乎不起作用。谢谢。

Answer 1

这里有一种方法：

df.index = pd.to_datetime(df.index)
year_month = df.index.strftime('%Y-%m')
#format the date so its simpler to find where the changes occur
rows_diff = np.flatnonzero(year_month[:-1]!= year_month[1:])+1
#see where the Value1 is 0 in the first row
m = df.iloc[rows_diff, 0] == 0
#set those values to whatever Value2 is
df.loc[m.index, 'Value1'] = df.loc[m.index, 'Value2']

print(df)

              Value1  Value2
2008-01-01      -1       5
2008-01-03      -1       6
2008-02-25       7       7
2008-02-26      -1       8
2008-02-27       0       9
2008-03-02      10      10
2008-03-16      -1      11
2008-03-17      -1      12
2009-04-04      13      13
2009-04-07       0      14

Answer 2

这是另一种方法。希望对你有用：

import pandas as pd
import numpy as np
df = pd.DataFrame({'date':['2008-01-01','2008-01-03','2008-02-25','2008-02-26','2008-02-27','2008-03-02','2008-03-16','2008-03-17','2009-04-04','2009-04-07'],
                   'value1':[-1,-1,0,-1,0,5,-1,-1,-1,0],
                   'value2':[5,6,7,8,9,10,11,12,13,14]})
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df_a = (df.groupby(['year','month'])['day'].first()).reset_index()
df_a['aux'] = True
df_output = df.merge(df_a,how='left',on=['year','month','day'])
df_output['value1'] = np.where((df_output['aux'] == True) & (df['value1'] < 0),
         df['value2'],
         df['value1'])
df_output = df_output.drop(columns=['year','month','day','aux'])
print(df_output)

输出：

        date  value1  value2
0 2008-01-01       5       5
1 2008-01-03      -1       6
2 2008-02-25       0       7
3 2008-02-26      -1       8
4 2008-02-27       0       9
5 2008-03-02       5      10
6 2008-03-16      -1      11
7 2008-03-17      -1      12
8 2009-04-04      13      13
9 2009-04-07       0      14

Answer 3

我使用 groupby().head() 提取索引并使用 loc 进行更新的方法：

s = df.groupby(df.index.to_period('M'), as_index=False).head(1)
df.loc[s[s['Value1'].lt(0)].index, 'Value1'] = df['Value2']

输出：

            Value1  Value2
2008-01-01       5       5
2008-01-03      -1       6
2008-02-25       0       7
2008-02-26      -1       8
2008-02-27       0       9
2008-03-02       5      10
2008-03-16      -1      11
2008-03-17      -1      12
2009-04-04      13      13
2009-04-07       0      14