Question

打开包含未定义字符的文件（csv.gz）并将文件传递到函数中

2023-08-13

236

python csv error-handling unicode decode

我有一个函数，传递的参数是 5 个文件路径。但是，第一个路径是 csv.gz，文件内部似乎有一个未定义的字符。我该如何解决这个问题？

我使用的是 Python 版本 3.11.1。代码和错误消息如下所示。

function(r"filepath1", r"filepath2", r"filepath3", r"filepath4", r"filepath5")

错误消息：

Cell In[3], line 8, in function(filepath1, filepath2, filepath3, filepath4, filepath5)
 6 file1DateMap = {}
 7 infd = open(file1path1, 'r')
 8 infd.readline()
 9 for line in infd:
10     tokens = line.strip().split(',')
 
File ~\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
22 def decode(self, input, final=False):
23     return codecs.charmap_decode(input,self.errors,decoding_table)[0]
 
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 94: character maps to undefined

我尝试过

file = open(filename, encoding="utf8")

但在我的 Python 版本中编码未定义。

我尝试了“with open”方法

file2 = r"file2path"
file3 = r"file3path"
file4 = r"file4path"
file5 = r"file5path"
file1name = r"file1path"
with open(file1name, 'r') as file1:
    function(file1, file2, file3, file4, file5)

但该函数需要一个字符串：

TypeError: expected str, bytes or os.PathLike object, not TextIOWrapper

我希望该函数运行并将处理后的输出写入桌面上的文件夹。

更新

我在 Visual Studio Code 中检查了文件的编码，它显示为 UTF 8。我写了以下代码：

with open(r"path1", encoding="utf8") as openfile1:
    file1 = openfile1.read()

收到此错误:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

更新 2

使用此代码检查编码

with open(r"filepath1") as f:
    print(f)

encoding='cp1252'

但是现在当我传递新的编码参数时:

with open(r"path1", encoding="cp1252") as openfile1:
    file1 = openfile1.read()

我又回到原点，出现以下错误消息:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 94: character maps to undefined

更新 3

Gzip 有效。我使用了以下代码:

import gzip
with gzip.open(r"path1", mode="rb") as openfile1:
    file1 = openfile1.read()

Answer 1

此源代码中存在一些令人困惑的地方。

with open(file1name, 'r') as file1:
    function(file1, file2, file3, file4, file5)

请理解 file1 是一个打开的文件句柄，其 type(...) 为 TextIOWrapper。它是可迭代的，您可以从中请求文本行。相比之下， file2 等是 str 路径名；您无法从这些对象中请求文件系统文本行。

您为它们选择的命名并行结构可能会让您自己以及未来几个月遇到此代码的任何倒霉的维护工程师感到困惑。建议您采用 path2 .. path5 之类的名称。

您的默认编码似乎是 CodePage1252 。您通过省略可选的 encoding= 参数，使用 open(file1name, 'r') 请求该编码。请注意， mode='r' 是默认值，因此您也可以省略该值。

相比之下， open(filename, encoding="utf8") 使用完全不同的编码打开以进行读取访问。

编码是底层 .CSV 文件的属性，而不是您的程序的属性。也就是说，您必须知道正确的底层编码是什么，并且您必须告诉 open 正确的编码。您可以默认执行此操作，也可以明确执行，只要您做对了。我建议明确执行。

如果您不知道编码，请使用 /usr/bin/file 、 /usr/local/bin/iconv 或文本编辑器来了解它是什么，并且如果您对当前编码不满意，也许可以将其更改为 UTF-8。

大多数现代机器上的大多数文件都应该是 UTF-8 编码 - 否则会自找麻烦。但我离题了。

一旦您确定了一些已知的编码，就通过 encoding= 参数将其传递给 open ，您就可以开始工作了！

Answer 2

如果您有一个压缩为 gzip 文件的 CSV 文件，则应该能够像下面这样简单地读取 gzip 文件：

with gzip.open("input.csv.gz", "rt", newline="", encoding="utf-8") as f:

我相信您会希望 rt 将其读取为文本（而不是 rb ，后者将返回未解码的字节）；当然还要选择文件的实际编码（在我的示例中，我总是使用 utf-8）。

要进一步解码文本文件 f 中的 CSV，我建议使用标准库的 csv 模块：

...
    reader = csv.reader(f)
    for row in reader:
        print(row)