Type Annotating Pandas DataFrames
I do this for dataframes in docstrings where it's reasonable. Sometimes it's not reasonable.
:param dataframe: pd.DataFrame [M x (3+N)]
'id': int
ID column
'value': int
Number of things
'color': str
Color of things
Remaining columns are properties; all should be float64s
There's probably a better way to do this, but I haven't found it.
I have tried @Xukrao's method. To have a summary table is really nice.
Also inspired by another question in stackoverflow, to use the csv-table
block is more convenient in terms of modification. Don't have to worry about alignment and "=". For example:
intra_edges (DataFrame): correspondence between intra-edges in
planar graph and in multilayer graph.
.. csv-table::
:header: name, dtype, definition
source_original (index), object, target in planar graph
target_original (index), object, target in planar graph
source, object, current source bus
target, object, current target bus
inter_edges (DataFrame): correspondence between inter-nodes in
planar graph and inter-edges in multilayer graph.
====== ======= ============================ ==========
name dtype definition is_index
====== ======= ============================ ==========
node object name in planar graph True
upper int64 integer index of upper layer False
lower int64 integer index of lower layer False
source object source node in supra graph False
target object target node in supra graph False
====== ======= ============================ ==========
Docstring format
I use the numpy docstring convention as a basis. If a function's input parameter or return parameter is a pandas dataframe with predetermined columns, then I add a reStructuredText-style table with column descriptions to the parameter description. As an example:
def random_dataframe(no_rows):
"""Return dataframe with random data.
Parameters
----------
no_rows : int
Desired number of data rows.
Returns
-------
pd.DataFrame
Dataframe with with randomly selected values. Data columns are as follows:
========== ==============================================================
rand_int randomly chosen whole numbers (as `int`)
rand_float randomly chosen numbers with decimal parts (as `float`)
rand_color randomly chosen colors (as `str`)
rand_bird randomly chosen birds (as `str`)
========== ==============================================================
"""
df = pd.DataFrame({
"rand_int": np.random.randint(0, 100, no_rows),
"rand_float": np.random.rand(no_rows),
"rand_color": np.random.choice(['green', 'red', 'blue', 'yellow'], no_rows),
"rand_bird": np.random.choice(['kiwi', 'duck', 'owl', 'parrot'], no_rows),
})
return df
Bonus: sphinx compatibility
The aforementioned docstring format is compatible with the sphinx autodoc documentation generator. This is how the docstring looks like in HTML documentation that was automatically generated by sphinx (using the nature theme):