How do I select an element in array column of a data frame?

Storing lists as values in a Pandas DataFrame tends to be a mistake because it prevents you from taking advantage of fast NumPy or Pandas vectorized operations.

Therefore, you might be better off converting your DataFrame of lists of numbers into a wider DataFrame with native NumPy dtypes:

import numpy as np
import pandas as pd

pa = pd.DataFrame({'a':np.array([[1.,4.],[2.],[3.,4.,5.]])})
df = pd.DataFrame(pa['a'].values.tolist())
#      0    1    2
# 0  1.0  4.0  NaN
# 1  2.0  NaN  NaN
# 2  3.0  4.0  5.0

Now, you could select the first column like this:

In [36]: df.iloc[:, 0]
Out[36]: 
0    1.0
1    2.0
2    3.0
Name: 0, dtype: float64

or the first row like this:

In [37]: df.iloc[0, :]
Out[37]: 
0    1.0
1    4.0
2    NaN
Name: 0, dtype: float64

If you wish to drop NaNs, use .dropna():

In [38]: df.iloc[0, :].dropna()
Out[38]: 
0    1.0
1    4.0
Name: 0, dtype: float64

and .tolist() to retrieve the values as a list:

In [39]: df.iloc[0, :].dropna().tolist()
Out[39]: [1.0, 4.0]

but if you wish to leverage NumPy/Pandas for speed, you'll want to express your calculation as vectorized operations on df itself without converting back to Python lists.

pa.loc[row] selects the row with label row.

pa.loc[row, col] selects the cells which are the instersection of row and col

pa.loc[:, col] selects all rows and the column named col. Note that although this works it is not the idiomatic way to refer to a column of a dataframe. For that you should use pa['a']

Now you have lists in the cells of your column so you can use the vectorized string methods to access the elements of those lists like so.

pa['a'].str[0] #first value in lists
pa['a'].str[-1] #last value in lists

How do I select an element in array column of a data frame?

Tags:

Python

Pandas

Arrays

Numpy

Related

Recent Posts