开发者问题收集

如何将数据框列拆分为更多列,以另一列值为条件?

2023-04-04
443

我陷入困境,因为我无法将数据框列拆分为更多列,条件是另一个列值。我有一个 pandas 数据框,它是直接从包含超过 100K 行的“.csv”文件生成的。

摘录 1:

在此处输入图像描述

我想将列 dca 用“,”(逗号)拆分为更多列。拆分的数量将受 n_mppts 中的值限制。

2023-04-12 编辑:

我可以使用以下代码成功执行从此 .csv 文件生成的数据框中的拆分列操作(感谢@Abdulmajeed 的解决方案):

def split_dca(row):
    values = row['dca'].split(',') if row['dca'] else []
    values += [float('NaN')] * (row['n_mppts'] - len(values))
    values = values[:row['n_mppts']]
    return pd.Series(values)

df_dca_dcv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418643 entries, 0 to 418642
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   pipe_id      418643 non-null  int64 
 1   date         418643 non-null  object
 2   inverter_id  418643 non-null  object
 3   n_mppts      418643 non-null  int64 
 4   dca          418538 non-null  object
 5   dcv          418538 non-null  object
dtypes: int64(2), object(4)
memory usage: 19.2+ MB

df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace('{', '')
df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace('}', '')
df_dca_dcv['dca'] = df_dca_dcv['dca'].astype(str)

摘录2:

在此处输入图片描述

mppts_dca = df_dca_dcv.apply(split_dca, axis=1)
mppts_dca['dca_mppt_0'] = pd.to_numeric(mppts_dca[0], errors='coerce')
mppts_dca['dca_mppt_1'] = pd.to_numeric(mppts_dca[1], errors='coerce')
mppts_dca['dca_mppt_2'] = pd.to_numeric(mppts_dca[2], errors='coerce')
mppts_dca['dca_mppt_3'] = pd.to_numeric(mppts_dca[3], errors='coerce')
mppts_dca['dca_mppt_4'] = pd.to_numeric(mppts_dca[4], errors='coerce')
mppts_dca['dca_mppt_5'] = pd.to_numeric(mppts_dca[5], errors='coerce')
mppts_dca['dca_mppt_6'] = pd.to_numeric(mppts_dca[6], errors='coerce')
mppts_dca['dca_mppt_7'] = pd.to_numeric(mppts_dca[7], errors='coerce')
mppts_dca['dca_mppt_8'] = pd.to_numeric(mppts_dca[8], errors='coerce')
df_dca_dcv = pd.concat([df_dca_dcv, mppts_dca], axis=1)

摘录3:

在此处输入图片描述

但是,当我从指定 inverter_id =a2 的 pandas sql 查询生成数据框时,我遇到了一个问题,因此当前的解决方案不会成功(其他 inverter_id 值也存在此问题):

df_dca_dcv = pd.read_sql_query("select pipe_id,created_at as date,inverter_id,n_mppts,dca,dcv from inverters where inverter_id = 'a2' order by pipe_id, inverter_id, date;", con=con) # connected to a postgreSQL db
df_dca_dcv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16507 entries, 0 to 16506
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   pipe_id      16507 non-null  object        
 1   date         16507 non-null  datetime64[ns]
 2   inverter_id  16507 non-null  object        
 3   n_mppts      16507 non-null  int64         
 4   dca          16428 non-null  object        
 5   dcv          16428 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(4)
memory usage: 773.9+ KB

dca Dtype 仍然是对象,但现在它的值位于“[ ]”之间,而不是“{ }”(与摘录 1 不同),当我执行此操作时:

df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace('[', '')
df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace(']', '')
df_dca_dcv['dca'] = df_dca_dcv['dca'].astype(str)

我收到以下错误:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[6], line 2
      1 df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace("[", "")
----> 2 df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace("]", "")

File ~\Anaconda3\lib\site-packages\pandas\core\generic.py:5575, in NDFrame.__getattr__(self, name)
   5568 if (
   5569     name not in self._internal_names_set
   5570     and name not in self._metadata
   5571     and name not in self._accessors
   5572     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5573 ):
   5574     return self[name]
-> 5575 return object.__getattribute__(self, name)

File ~\Anaconda3\lib\site-packages\pandas\core\accessor.py:182, in CachedAccessor.__get__(self, obj, cls)
    179 if obj is None:
    180     # we're accessing the attribute of the class, i.e., Dataset.geo
    181     return self._accessor
--> 182 accessor_obj = self._accessor(obj)
    183 # Replace the property with the accessor object. Inspired by:
    184 # https://www.pydanny.com/cached-property.html
    185 # We need to use object.__setattr__ because we overwrite __setattr__ on
    186 # NDFrame
    187 object.__setattr__(obj, self._name, accessor_obj)

File ~\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py:177, in StringMethods.__init__(self, data)
    174 def __init__(self, data):
    175     from pandas.core.arrays.string_ import StringDtype
--> 177     self._inferred_dtype = self._validate(data)
    178     self._is_categorical = is_categorical_dtype(data.dtype)
    179     self._is_string = isinstance(data.dtype, StringDtype)

File ~\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py:231, in StringMethods._validate(data)
    228 inferred_dtype = lib.infer_dtype(values, skipna=True)
    230 if inferred_dtype not in allowed_types:
--> 231     raise AttributeError("Can only use .str accessor with string values!")
    232 return inferred_dtype

AttributeError: Can only use .str accessor with string values!

我预期了“.astype(str)”操作,然后执行了“.str.replace(...)” 操作。但是,当我现在查看数据框时

摘录 4:

在此处输入图像描述

dca 值的格式与摘录 2 中的格式不同(例如“Decimal('2.2'),Decimal('2.2')...”)。当我继续执行

mppts_dca = df_dca_dcv.apply(split_dca, axis=1)
df_dca_dcv = pd.concat([df_dca_dcv, mppts_dca], axis=1)
df_dca_dcv['date'] = df_dca_dcv['date'].astype('datetime64[ns]')
df_dca_dcv['dca_mppt_0'] = pd.to_numeric(df_dca_dcv[0], errors='coerce')
df_dca_dcv['dca_mppt_1'] = pd.to_numeric(df_dca_dcv[1], errors='coerce')

时, dca 值不会传递给新拆分的列,这(我猜)是因为“pd.to_numeric(”无法读取“Decimal(...)”:

摘录 5:

在此处输入图片描述

我尝试了以下所有方法将 dca 列转换为字符串:

METHOD1: df_dca_dcv['dca'] = df_dca_dcv['dca'].map(str) #produced same output format as before
METHOD2: df_dca_dcv['dca'] = df_dca_dcv['dca'].apply(str) #produced same output format as before
METHOD3: df_dca_dcv['dca'] = df_dca_dcv['dca'].astype(str) #generated the following error:

ValueError                                Traceback (most recent call last)
Cell In[6], line 1
----> 1 df_dca_dcv['dca'] = df_dca_dcv['dca'].values.astype(str)
ValueError: setting an array element with a sequence

METHOD4: df_dca_dcv['dca'] = df_dca_dcv['dca'].values.astype(str) #generated same error as METHOD3

METHOD5: df_dca_dcv['dca'] = df_dca_dcv['dca'].applymap(str)  #generated the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[7], line 1
----> 1 df_dca_dcv['dca'] = df_dca_dcv['dca'].applymap(str)
File ~\Anaconda3\lib\site-packages\pandas\core\generic.py:5575, in NDFrame.__getattr__(self, name)
   5568 if (
   5569     name not in self._internal_names_set
   5570     and name not in self._metadata
   5571     and name not in self._accessors
   5572     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5573 ):
   5574     return self[name]
-> 5575 return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'applymap'

METHOD6:
def convert_float_string(row):
    float_list = row['dca']
    if len(float_list) > 0:
        string_list = ["%.2f" % i for i in float_list]
    else:
        string_list = float('NaN')
    return string_list
df_dca_dcv['dca'] = df_dca_dcv.apply(lambda row: convert_float_string(row), axis=1) #generated the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 df_dca_dcv['dca'] = df_dca_dcv.apply(lambda row: convert_float_string(row), axis=1)
File ~\Anaconda3\lib\site-packages\pandas\core\frame.py:8839, in DataFrame.apply(self, func, axis, raw, result_type, args, **kwargs)
   8828 from pandas.core.apply import frame_apply
   8830 op = frame_apply(
   8831     self,
   8832     func=func,
   (...)
   8837     kwargs=kwargs,
   8838 )
-> 8839 return op.apply().__finalize__(self, method="apply")
File ~\Anaconda3\lib\site-packages\pandas\core\apply.py:727, in FrameApply.apply(self)
    724 elif self.raw:
    725     return self.apply_raw()
--> 727 return self.apply_standard()
File ~\Anaconda3\lib\site-packages\pandas\core\apply.py:851, in FrameApply.apply_standard(self)
    850 def apply_standard(self):
--> 851     results, res_index = self.apply_series_generator()
    853     # wrap results
    854     return self.wrap_results(results, res_index)
File ~\Anaconda3\lib\site-packages\pandas\core\apply.py:867, in FrameApply.apply_series_generator(self)
    864 with option_context("mode.chained_assignment", None):
    865     for i, v in enumerate(series_gen):
    866         # ignore SettingWithCopy here in case the user mutates
--> 867         results[i] = self.f(v)
    868         if isinstance(results[i], ABCSeries):
    869             # If we have a view on v, we need to make a copy because
    870             #  series_generator will swap out the underlying data
    871             results[i] = results[i].copy(deep=False)
Cell In[8], line 1, in <lambda>(row)
----> 1 df_dca_dcv['dca'] = df_dca_dcv.apply(lambda row: convert_float_string(row), axis=1)
Cell In[6], line 3, in convert_float_string(row)
      1 def convert_float_string(row):
      2     float_list = row['dca']
----> 3     if len(float_list) > 0:
      4         string_list = ["%.2f" % i for i in float_list]
      5     else:
TypeError: object of type 'NoneType' has no len()

...如果我只是跳过将 dca 转换为字符串并使用

df_dca_dcv['dca'] = df_dca_dcv['dca'].replace("[", "")
df_dca_dcv['dca'] = df_dca_dcv['dca'].replace("]", "")

替换不需要地点。

我将非常感激任何关于如何解决该问题的建议。

2个回答

我不确定我是否正确理解了您的问题。 但您可以将自定义函数与 apply 结合使用。示例如下:

希望对您有所帮助

def split_dca(row):
    values = row['dca'].split(',') if row['dca'] else []
    
    values += [float('NaN')] * (row['n_mppts'] - len(values))
    
    values = values[:row['n_mppts']]
    
    return pd.Series(values)

split_columns = df.apply(split_dca, axis=1)
Abdulmajeed
2023-04-04

更新

They can have 0 to 11 elements and the split operation should filter only the 'n' first elements from left to right, where 'n' = row['n_mppts']

由于 dca 的长度可变,因此您可以使用以下代码:

# Part 0: fix special cases
mask = df['dca'].isna()
df.loc[mask, 'dca'] = df.loc[mask, 'dca'].apply(lambda x: [])
lens = df['dca'].str.len().values  # get the length of each array
n_mppts = df['n_mppts'].mask(df['n_mppts'].gt(lens), lens)

# Part 1: pad each array to be stacked
nrows, ncols = len(df), int(lens.max())
dca = np.zeros((nrows, ncols))  # create a 0s target array
mask = lens[:, None] > np.arange(ncols)
dca[mask] = np.concatenate(df['dca']).astype(float)  # copy data

# Part 2: keep values according n_mppts
mask = n_mppts.values[:, None] <= np.arange(ncols)
dca[mask] = np.nan
dca_df = pd.DataFrame(dca).add_prefix('dca_mppt_')
dca_df

输出:

   dca_mppt_0  dca_mppt_1  dca_mppt_2  dca_mppt_3  dca_mppt_4  dca_mppt_5  dca_mppt_6  dca_mppt_7  dca_mppt_8  dca_mppt_9  dca_mppt_10
0         2.3         2.3         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN          NaN
1         2.6         2.6         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN          NaN
2         2.9         2.9         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN          NaN
3         6.0         5.9         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN          NaN
4         3.9         3.9         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN          NaN

您从 read_sql_query 收到的是 Decimal 实例列表。

看来 dcadcv 的长度都是 11 个项目。您可以使用 numpy 以矢量化方式获取预期输出:

dca = np.vstack(df['dca']).astype(float)
mask = df['n_mppts'].values[:, None] <= np.arange(12)
dca[mask] = np.nan
dca_df = pd.DataFrame(dca).add_prefix('dca_mppt_')

输出:

>>> dca_df
   dca_mppt_0  dca_mppt_1  dca_mppt_2  dca_mppt_3  dca_mppt_4  dca_mppt_5  dca_mppt_6  dca_mppt_7  dca_mppt_8  dca_mppt_9  dca_mppt_10
0         2.3         2.3         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN          NaN
1         2.6         2.6         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN          NaN
2         2.9         2.9         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN          NaN
3         6.0         5.9         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN          NaN
4         3.9         3.9         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN          NaN
Corralien
2023-04-19