将多个 CSV 文件导入 pandas 并连接成一个 DataFrame
2014-01-03
985982
我想将目录中的几个 CSV 文件读入 pandas 并将它们连接成一个大的 DataFrame。但我还没能搞清楚。这是我目前所得到的:
import glob
import pandas as pd
# Get data file names
path = r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")
dfs = []
for filename in filenames:
dfs.append(pd.read_csv(filename))
# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)
我想我需要一些 for 循环内的帮助?
3个回答
请参阅
pandas: IO tools
了解所有可用的
.read_
方法。
如果所有 CSV 文件都具有相同的列,请尝试以下代码。
我已添加
header=0
,以便在读取 CSV 文件的第一行后,可以将其指定为列名。
import pandas as pd
import glob
import os
path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(os.path.join(path , "/*.csv"))
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
或者,归功于 Sid 的评论。
all_files = glob.glob(os.path.join(path, "*.csv"))
df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)
- 通常需要识别每个数据样本,这可以通过向数据框。 此示例将使用标准库中的
-
pathlib
。它将路径视为具有方法的对象,而不是要切片的字符串。
导入和设置
from pathlib import Path
import pandas as pd
import numpy as np
path = r'C:\DRO\DCL_rawdata_files' # or unix / linux / mac path
# Get the files from the path provided in the OP
files = Path(path).glob('*.csv') # .rglob to get subdirectories
选项 1:
- 添加带有文件名的新列
dfs = list()
for f in files:
data = pd.read_csv(f)
# .stem is method for pathlib objects to get the filename w/o the extension
data['file'] = f.stem
dfs.append(data)
df = pd.concat(dfs, ignore_index=True)
选项 2:
-
使用
enumerate
添加具有通用名称的新列
dfs = list()
for i, f in enumerate(files):
data = pd.read_csv(f)
data['file'] = f'File {i}'
dfs.append(data)
df = pd.concat(dfs, ignore_index=True)
选项 3:
-
使用列表推导创建数据框,然后使用
np.repeat
添加新列。-
[f'S{i}' for i in range(len(dfs))]
创建一个字符串列表来命名每个数据框。 -
[len(df) for df in dfs]
创建一个长度列表
-
- 此选项的归因于此绘图 答案 。
# Read the files into dataframes
dfs = [pd.read_csv(f) for f in files]
# Combine the list of dataframes
df = pd.concat(dfs, ignore_index=True)
# Add a new column
df['Source'] = np.repeat([f'S{i}' for i in range(len(dfs))], [len(df) for df in dfs])
选项 4:
df = pd.concat((pd.read_csv(f).assign(filename=f.stem) for f in files), ignore_index=True)
或
df = pd.concat((pd.read_csv(f).assign(Source=f'S{i}') for i, f in enumerate(files)), ignore_index=True)
Gaurav Singh
2014-01-20
darindaCoder 的答案 的替代方案:
path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(os.path.join(path, "*.csv")) # advisable to use os.path.join as this makes concatenation OS independent
df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
# doesn't create a list, nor does it append to one
Sid
2016-04-05
import glob
import os
import pandas as pd
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "my_files*.csv"))))
Jose Antonio Martin H
2017-02-21