如何将数据框列拆分为更多列,以另一列值为条件?
我陷入困境,因为我无法将数据框列拆分为更多列,条件是另一个列值。我有一个 pandas 数据框,它是直接从包含超过 100K 行的“.csv”文件生成的。
摘录 1:
我想将列
dca
用“,”(逗号)拆分为更多列。拆分的数量将受
n_mppts
中的值限制。
2023-04-12 编辑:
我可以使用以下代码成功执行从此 .csv 文件生成的数据框中的拆分列操作(感谢@Abdulmajeed 的解决方案):
def split_dca(row):
values = row['dca'].split(',') if row['dca'] else []
values += [float('NaN')] * (row['n_mppts'] - len(values))
values = values[:row['n_mppts']]
return pd.Series(values)
df_dca_dcv.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418643 entries, 0 to 418642
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pipe_id 418643 non-null int64
1 date 418643 non-null object
2 inverter_id 418643 non-null object
3 n_mppts 418643 non-null int64
4 dca 418538 non-null object
5 dcv 418538 non-null object
dtypes: int64(2), object(4)
memory usage: 19.2+ MB
df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace('{', '')
df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace('}', '')
df_dca_dcv['dca'] = df_dca_dcv['dca'].astype(str)
摘录2:
mppts_dca = df_dca_dcv.apply(split_dca, axis=1)
mppts_dca['dca_mppt_0'] = pd.to_numeric(mppts_dca[0], errors='coerce')
mppts_dca['dca_mppt_1'] = pd.to_numeric(mppts_dca[1], errors='coerce')
mppts_dca['dca_mppt_2'] = pd.to_numeric(mppts_dca[2], errors='coerce')
mppts_dca['dca_mppt_3'] = pd.to_numeric(mppts_dca[3], errors='coerce')
mppts_dca['dca_mppt_4'] = pd.to_numeric(mppts_dca[4], errors='coerce')
mppts_dca['dca_mppt_5'] = pd.to_numeric(mppts_dca[5], errors='coerce')
mppts_dca['dca_mppt_6'] = pd.to_numeric(mppts_dca[6], errors='coerce')
mppts_dca['dca_mppt_7'] = pd.to_numeric(mppts_dca[7], errors='coerce')
mppts_dca['dca_mppt_8'] = pd.to_numeric(mppts_dca[8], errors='coerce')
df_dca_dcv = pd.concat([df_dca_dcv, mppts_dca], axis=1)
摘录3:
但是,当我从指定
inverter_id
=a2 的 pandas sql 查询生成数据框时,我遇到了一个问题,因此当前的解决方案不会成功(其他
inverter_id
值也存在此问题):
df_dca_dcv = pd.read_sql_query("select pipe_id,created_at as date,inverter_id,n_mppts,dca,dcv from inverters where inverter_id = 'a2' order by pipe_id, inverter_id, date;", con=con) # connected to a postgreSQL db
df_dca_dcv.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16507 entries, 0 to 16506
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pipe_id 16507 non-null object
1 date 16507 non-null datetime64[ns]
2 inverter_id 16507 non-null object
3 n_mppts 16507 non-null int64
4 dca 16428 non-null object
5 dcv 16428 non-null object
dtypes: datetime64[ns](1), int64(1), object(4)
memory usage: 773.9+ KB
列
dca
Dtype 仍然是对象,但现在它的值位于“[ ]”之间,而不是“{ }”(与摘录 1 不同),当我执行此操作时:
df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace('[', '')
df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace(']', '')
df_dca_dcv['dca'] = df_dca_dcv['dca'].astype(str)
我收到以下错误:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[6], line 2
1 df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace("[", "")
----> 2 df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace("]", "")
File ~\Anaconda3\lib\site-packages\pandas\core\generic.py:5575, in NDFrame.__getattr__(self, name)
5568 if (
5569 name not in self._internal_names_set
5570 and name not in self._metadata
5571 and name not in self._accessors
5572 and self._info_axis._can_hold_identifiers_and_holds_name(name)
5573 ):
5574 return self[name]
-> 5575 return object.__getattribute__(self, name)
File ~\Anaconda3\lib\site-packages\pandas\core\accessor.py:182, in CachedAccessor.__get__(self, obj, cls)
179 if obj is None:
180 # we're accessing the attribute of the class, i.e., Dataset.geo
181 return self._accessor
--> 182 accessor_obj = self._accessor(obj)
183 # Replace the property with the accessor object. Inspired by:
184 # https://www.pydanny.com/cached-property.html
185 # We need to use object.__setattr__ because we overwrite __setattr__ on
186 # NDFrame
187 object.__setattr__(obj, self._name, accessor_obj)
File ~\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py:177, in StringMethods.__init__(self, data)
174 def __init__(self, data):
175 from pandas.core.arrays.string_ import StringDtype
--> 177 self._inferred_dtype = self._validate(data)
178 self._is_categorical = is_categorical_dtype(data.dtype)
179 self._is_string = isinstance(data.dtype, StringDtype)
File ~\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py:231, in StringMethods._validate(data)
228 inferred_dtype = lib.infer_dtype(values, skipna=True)
230 if inferred_dtype not in allowed_types:
--> 231 raise AttributeError("Can only use .str accessor with string values!")
232 return inferred_dtype
AttributeError: Can only use .str accessor with string values!
我预期了“.astype(str)”操作,然后执行了“.str.replace(...)” 操作。但是,当我现在查看数据框时
摘录 4:
列
dca
值的格式与摘录 2 中的格式不同(例如“Decimal('2.2'),Decimal('2.2')...”)。当我继续执行
mppts_dca = df_dca_dcv.apply(split_dca, axis=1)
df_dca_dcv = pd.concat([df_dca_dcv, mppts_dca], axis=1)
df_dca_dcv['date'] = df_dca_dcv['date'].astype('datetime64[ns]')
df_dca_dcv['dca_mppt_0'] = pd.to_numeric(df_dca_dcv[0], errors='coerce')
df_dca_dcv['dca_mppt_1'] = pd.to_numeric(df_dca_dcv[1], errors='coerce')
时,
dca
值不会传递给新拆分的列,这(我猜)是因为“pd.to_numeric(”无法读取“Decimal(...)”:
摘录 5:
我尝试了以下所有方法将
dca
列转换为字符串:
METHOD1: df_dca_dcv['dca'] = df_dca_dcv['dca'].map(str) #produced same output format as before
METHOD2: df_dca_dcv['dca'] = df_dca_dcv['dca'].apply(str) #produced same output format as before
METHOD3: df_dca_dcv['dca'] = df_dca_dcv['dca'].astype(str) #generated the following error:
ValueError Traceback (most recent call last)
Cell In[6], line 1
----> 1 df_dca_dcv['dca'] = df_dca_dcv['dca'].values.astype(str)
ValueError: setting an array element with a sequence
METHOD4: df_dca_dcv['dca'] = df_dca_dcv['dca'].values.astype(str) #generated same error as METHOD3
METHOD5: df_dca_dcv['dca'] = df_dca_dcv['dca'].applymap(str) #generated the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[7], line 1
----> 1 df_dca_dcv['dca'] = df_dca_dcv['dca'].applymap(str)
File ~\Anaconda3\lib\site-packages\pandas\core\generic.py:5575, in NDFrame.__getattr__(self, name)
5568 if (
5569 name not in self._internal_names_set
5570 and name not in self._metadata
5571 and name not in self._accessors
5572 and self._info_axis._can_hold_identifiers_and_holds_name(name)
5573 ):
5574 return self[name]
-> 5575 return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'applymap'
METHOD6:
def convert_float_string(row):
float_list = row['dca']
if len(float_list) > 0:
string_list = ["%.2f" % i for i in float_list]
else:
string_list = float('NaN')
return string_list
df_dca_dcv['dca'] = df_dca_dcv.apply(lambda row: convert_float_string(row), axis=1) #generated the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[8], line 1
----> 1 df_dca_dcv['dca'] = df_dca_dcv.apply(lambda row: convert_float_string(row), axis=1)
File ~\Anaconda3\lib\site-packages\pandas\core\frame.py:8839, in DataFrame.apply(self, func, axis, raw, result_type, args, **kwargs)
8828 from pandas.core.apply import frame_apply
8830 op = frame_apply(
8831 self,
8832 func=func,
(...)
8837 kwargs=kwargs,
8838 )
-> 8839 return op.apply().__finalize__(self, method="apply")
File ~\Anaconda3\lib\site-packages\pandas\core\apply.py:727, in FrameApply.apply(self)
724 elif self.raw:
725 return self.apply_raw()
--> 727 return self.apply_standard()
File ~\Anaconda3\lib\site-packages\pandas\core\apply.py:851, in FrameApply.apply_standard(self)
850 def apply_standard(self):
--> 851 results, res_index = self.apply_series_generator()
853 # wrap results
854 return self.wrap_results(results, res_index)
File ~\Anaconda3\lib\site-packages\pandas\core\apply.py:867, in FrameApply.apply_series_generator(self)
864 with option_context("mode.chained_assignment", None):
865 for i, v in enumerate(series_gen):
866 # ignore SettingWithCopy here in case the user mutates
--> 867 results[i] = self.f(v)
868 if isinstance(results[i], ABCSeries):
869 # If we have a view on v, we need to make a copy because
870 # series_generator will swap out the underlying data
871 results[i] = results[i].copy(deep=False)
Cell In[8], line 1, in <lambda>(row)
----> 1 df_dca_dcv['dca'] = df_dca_dcv.apply(lambda row: convert_float_string(row), axis=1)
Cell In[6], line 3, in convert_float_string(row)
1 def convert_float_string(row):
2 float_list = row['dca']
----> 3 if len(float_list) > 0:
4 string_list = ["%.2f" % i for i in float_list]
5 else:
TypeError: object of type 'NoneType' has no len()
...如果我只是跳过将
dca
转换为字符串并使用
df_dca_dcv['dca'] = df_dca_dcv['dca'].replace("[", "")
df_dca_dcv['dca'] = df_dca_dcv['dca'].replace("]", "")
替换不需要地点。
我将非常感激任何关于如何解决该问题的建议。
我不确定我是否正确理解了您的问题。 但您可以将自定义函数与 apply 结合使用。示例如下:
希望对您有所帮助
def split_dca(row):
values = row['dca'].split(',') if row['dca'] else []
values += [float('NaN')] * (row['n_mppts'] - len(values))
values = values[:row['n_mppts']]
return pd.Series(values)
split_columns = df.apply(split_dca, axis=1)
更新
They can have 0 to 11 elements and the split operation should filter only the 'n' first elements from left to right, where 'n' = row['n_mppts']
由于
dca
的长度可变,因此您可以使用以下代码:
# Part 0: fix special cases
mask = df['dca'].isna()
df.loc[mask, 'dca'] = df.loc[mask, 'dca'].apply(lambda x: [])
lens = df['dca'].str.len().values # get the length of each array
n_mppts = df['n_mppts'].mask(df['n_mppts'].gt(lens), lens)
# Part 1: pad each array to be stacked
nrows, ncols = len(df), int(lens.max())
dca = np.zeros((nrows, ncols)) # create a 0s target array
mask = lens[:, None] > np.arange(ncols)
dca[mask] = np.concatenate(df['dca']).astype(float) # copy data
# Part 2: keep values according n_mppts
mask = n_mppts.values[:, None] <= np.arange(ncols)
dca[mask] = np.nan
dca_df = pd.DataFrame(dca).add_prefix('dca_mppt_')
dca_df
输出:
dca_mppt_0 dca_mppt_1 dca_mppt_2 dca_mppt_3 dca_mppt_4 dca_mppt_5 dca_mppt_6 dca_mppt_7 dca_mppt_8 dca_mppt_9 dca_mppt_10
0 2.3 2.3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2.6 2.6 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2.9 2.9 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 6.0 5.9 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 3.9 3.9 NaN NaN NaN NaN NaN NaN NaN NaN NaN
您从
read_sql_query
收到的是
Decimal
实例列表。
看来
dca
和
dcv
的长度都是 11 个项目。您可以使用 numpy 以矢量化方式获取预期输出:
dca = np.vstack(df['dca']).astype(float)
mask = df['n_mppts'].values[:, None] <= np.arange(12)
dca[mask] = np.nan
dca_df = pd.DataFrame(dca).add_prefix('dca_mppt_')
输出:
>>> dca_df
dca_mppt_0 dca_mppt_1 dca_mppt_2 dca_mppt_3 dca_mppt_4 dca_mppt_5 dca_mppt_6 dca_mppt_7 dca_mppt_8 dca_mppt_9 dca_mppt_10
0 2.3 2.3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2.6 2.6 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2.9 2.9 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 6.0 5.9 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 3.9 3.9 NaN NaN NaN NaN NaN NaN NaN NaN NaN