Pandas 'describe' is not returning summary of all columns
As of pandas v15.0, use the parameter, DataFrame.describe(include = 'all')
to get a summary of all the columns when the dataframe has mixed column types. The default behavior is to only provide a summary for the numerical columns.
Example:
In[1]:
df = pd.DataFrame({'$a':['a', 'b', 'c', 'd', 'a'], '$b': np.arange(5)})
df.describe(include = 'all')
Out[1]:
$a $b
count 5 5.000000
unique 4 NaN
top a NaN
freq 2 NaN
mean NaN 2.000000
std NaN 1.581139
min NaN 0.000000
25% NaN 1.000000
50% NaN 2.000000
75% NaN 3.000000
max NaN 4.000000
The numerical columns will have NaNs for summary statistics pertaining to objects (strings) and vice versa.
Summarizing only numerical or object columns
- To call
describe()
on just the numerical columns usedescribe(include = [np.number])
To call
describe()
on just the objects (strings) usingdescribe(include = ['O'])
.In[2]: df.describe(include = [np.number]) Out[3]: $b count 5.000000 mean 2.000000 std 1.581139 min 0.000000 25% 1.000000 50% 2.000000 75% 3.000000 max 4.000000 In[3]: df.describe(include = ['O']) Out[3]: $a count 5 unique 4 top a freq 2
'describe()' on a DataFrame only works for numeric types. If you think you have a numeric variable and it doesn't show up in 'decribe()', change the type with:
df[['col1', 'col2']] = df[['col1', 'col2']].astype(float)
You could also create new columns for handling the numeric part of a mix type column, or convert strings to numbers using a dictionary and the map() function.
'describe()' on a non-numerical Series will give you some statistics (like count, unique and the most frequently occurring value).
pd.options.display.max_columns = DATA.shape[1]
will work.
Here DATA
is a 2d matrix, and above code will display stats vertically.
In addition to the data type issues discussed in the other answers, you might also have too many columns to display. If there are too many columns, the middle columns will be replaced with a total of three dots (...
).
Other answers have pointed out that the include='all'
parameter of describe
can help with the data type issue. Another question asked, "How do I expand the output display to see more columns?" The solution is to modify the display.max_columns
setting, which can even be done temporarily. For example, to display up to 40 columns of output from a single describe
statement:
with pd.option_context('display.max_columns', 40):
print(df.describe(include='all'))