Question

TypeError：NAType 类型的对象不是 JSON 可序列化的

2020-12-22

12335

python pandas dataframe numpy python-3.8

提前感谢您的帮助。

我的python代码读取json输入文件并将数据加载到数据框中，屏蔽或更改配置指定的数据框列，并在最后阶段创建json输出文件。

read json into data frame --> mask/change the df column ---> generate json

输入json：

[
    {
        "BinLogFilename": "mysql.log",
        "Type": "UPDATE",
        "Table": "users",
        "ServerId": 1,
        "BinLogPosition": 2111
    },        {
    {   "BinLogFilename": "mysql.log",
        "Type": "UPDATE",
        "Table": "users",
        "ServerId": null,
        "BinLogPosition": 2111
    },
  ...
]

当我将上述json加载到数据框中时，数据框列“ServerId”具有浮点值，因为它在几个json输入块中为null。

主要的中央逻辑将“ServerId”转换/伪造为另一个数字，但是输出包含浮点数。

输出json：

[
      {
            "BinLogFilename": "mysql.log",
            "Type": "UPDATE",
            "Table": "users",
            "ServerId": 5627.0,
            "BinLogPosition": 2111
        }, 
        {
            "BinLogFilename": "mysql.log",
            "Type": "UPDATE",
            "Table": "users",
            "ServerId": null,
            "BinLogPosition": 2111
        },
     ....
]

屏蔽逻辑

df['ServerId'] = [fake.pyint() if not(pd.isna(df['ServerId'][index])) else np.nan for index in range(len(df['ServerId']))]

挑战在于，输出“ServerId”应该只包含整数，但不幸的是它包含浮点数。

df['ServerId']
0     9590.0
1        NaN
2     1779.0
3     1303.0
4        NaN

我找到了这个问题的答案，使用“Int64”

df['ServerId'] = df['ServerId'].astype('Int64')
0     8920
1     <NA>
2     9148
3     2434
4     <NA>

但是使用“Int64”，它会将 NaN 转换为 NA，在写回 json 时，我收到错误，

TypeError: Object of type NAType is not JSON serializable

with gzip.open(outputFile, 'w') as outfile:
    outfile.write(json.dumps(json_objects_list).encode('utf-8'))

转换为“Int64”数据类型后是否可以保留 NaN？如果不可能，我该如何修复错误？

Answer 1

事实上，Pandas NA 和 NaT 无法通过内置的 Python json 库进行 JSON 序列化。

但 Pandas DataFrame to_json() 方法将为您处理这些值并将其转换为 JSON null。

from pandas import DataFrame, Series, NA, NaT

df = DataFrame({"ServerId" : Series([8920, NA, 9148, 2434, NA], dtype="Int64") })
s = df.to_json()

# -> {"ServerId":{"0":8920,"1":null,"2":9148,"3":2434,"4":null}}

Answer 2

此错误与您的 pandas DataFrame 中的某些值为 pd.NaT 有关，因此在调用 json.dumps() 时会引起麻烦。

一种可能的解决方案是先将所有缺失值（包括 None 、 pd.NaT 、 numpy.nan 以及任何其他与缺失值相关的类型）替换为 np.nan ，然后将后者替换为 None ：

import numpy as np

df = df.fillna(np.nan).replace([np.nan], [None])

Answer 3

只需像这样添加您的自定义 json 编码器：

from decimal import Decimal
from pandas._libs.missing import NAType
    
class MyEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, Decimal):
            return str(obj)
        if isinstance(obj, NAType):
            return ""
        # 👇️ otherwise use the default behavior
        return json.JSONEncoder.default(self, obj)

然后您可以将其传递给 json.dumps()，例如：

print(json.dumps(list(df.T.to_dict().values()), indent=4, cls=MyEncoder))