Ambiguity in Pandas Dataframe / Numpy Array "axis" definition
Another way to explain:
// Not realistic but ideal for understanding the axis parameter
df = pd.DataFrame([[1, 1, 1, 1], [2, 2, 2, 2], [3, 3, 3, 3]],
columns=["idx1", "idx2", "idx3", "idx4"],
index=["idx1", "idx2", "idx3"]
)
---------------------------------------1
| idx1 idx2 idx3 idx4
| idx1 1 1 1 1
| idx2 2 2 2 2
| idx3 3 3 3 3
0
About df.drop
(axis means the position)
A: I wanna remove idx3.
B: **Which one**? // typing while waiting response: df.drop("idx3",
A: The one which is on axis 1
B: OK then it is >> df.drop("idx3", axis=1)
// Result
---------------------------------------1
| idx1 idx2 idx4
| idx1 1 1 1
| idx2 2 2 2
| idx3 3 3 3
0
About df.apply
(axis means direction)
A: I wanna apply sum.
B: Which direction? // typing while waiting response: df.apply(lambda x: x.sum(),
A: The one which is on *parallel to axis 0*
B: OK then it is >> df.apply(lambda x: x.sum(), axis=0)
// Result
idx1 6
idx2 6
idx3 6
idx4 6
There are already proper answers, but I give you another example with > 2 dimensions.
The parameter axis
means axis to be changed.
For example, consider that there is a dataframe with dimension a x b x c.
df.mean(axis=1)
returns a dataframe with dimenstion a x 1 x c.df.drop("col4", axis=1)
returns a dataframe with dimension a x (b-1) x c.
Here, axis=1
means the second axis which is b
, so b
value will be changed in these examples.
It's perhaps simplest to remember it as 0=down and 1=across.
This means:
- Use
axis=0
to apply a method down each column, or to the row labels (the index). - Use
axis=1
to apply a method across each row, or to the column labels.
Here's a picture to show the parts of a DataFrame that each axis refers to:
It's also useful to remember that Pandas follows NumPy's use of the word axis
. The usage is explained in NumPy's glossary of terms:
Axes are defined for arrays with more than one dimension. A 2-dimensional array has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1). [my emphasis]
So, concerning the method in the question, df.mean(axis=1)
, seems to be correctly defined. It takes the mean of entries horizontally across columns, that is, along each individual row. On the other hand, df.mean(axis=0)
would be an operation acting vertically downwards across rows.
Similarly, df.drop(name, axis=1)
refers to an action on column labels, because they intuitively go across the horizontal axis. Specifying axis=0
would make the method act on rows instead.