Save additional attributes in Pandas Dataframe
The approach I use is to add additional MultiIndex
levels to store the additional information I want (I use the columns, but either would work). All the columns have the same values for these additional parameters. This is also useful since I can combine multiple dataframes or split out individual columns and these values are preserved.
>>> col=pd.MultiIndex.from_product([['US/Eastern'], ['A', 'B', 'C', 'E', 'D']], names=['local_tz', 'name'])
>>> a = pd.DataFrame(data=pd.np.random.randint(0,100,(10,5)),columns=col)
>>> print(a)
local_tz US/Eastern
name A B C E D
0 38 93 63 24 55
1 21 25 84 98 62
2 4 60 78 0 5
3 26 50 82 89 23
4 32 70 80 90 1
5 6 17 8 60 59
6 95 98 69 19 76
7 71 90 45 45 40
8 94 16 44 60 16
9 53 8 30 4 72
There is an open issue regarding the storage of custom metadata in NDFrames. But due to the multitudinous ways pandas functions may return DataFrames, the _metadata
attribute is not (yet) preserved in all situations.
For the time being, you'll just have to store the metadata in an auxilliary variable.
There are multiple options for storing DataFrames + metadata to files, depending on what format you wish to use -- pickle, JSON, HDF5 are all possibilities.
Here is how you could store and load a DataFrame with metadata using HDF5. The recipe for storing the metadata comes from the Pandas Cookbook.
import numpy as np
import pandas as pd
def h5store(filename, df, **kwargs):
store = pd.HDFStore(filename)
store.put('mydata', df)
store.get_storer('mydata').attrs.metadata = kwargs
store.close()
def h5load(store):
data = store['mydata']
metadata = store.get_storer('mydata').attrs.metadata
return data, metadata
a = pd.DataFrame(
data=pd.np.random.randint(0, 100, (10, 5)), columns=list('ABCED'))
filename = '/tmp/data.h5'
metadata = dict(local_tz='US/Eastern')
h5store(filename, a, **metadata)
with pd.HDFStore(filename) as store:
data, metadata = h5load(store)
print(data)
# A B C E D
# 0 9 20 92 43 25
# 1 2 64 54 0 63
# 2 22 42 3 83 81
# 3 3 71 17 64 53
# 4 52 10 41 22 43
# 5 48 85 96 72 88
# 6 10 47 2 10 78
# 7 30 80 3 59 16
# 8 13 52 98 79 65
# 9 6 93 55 40 3
print(metadata)
yields
{'local_tz': 'US/Eastern'}