how to convert a Series of arrays into a single matrix in pandas/numpy?
Another way is to extract the values of your series and use numpy.stack on them.
np.stack(s.values)
PS. I've run into similar situations often.
For the pandas>=0.24, you can also np.stack(s.to_numpy())
or np.concatenate(s.to_numpy())
, depending on your requirement.
If, for some reason, you have found yourself with that abomination of a Series
, getting it back into the sort of matrix
or array
you want is straightforward:
In [16]: s
Out[16]:
0 [1, 2, 3]
1 [2, 3, 4]
2 [3, 4, 5]
3 [2, 3, 4]
4 [3, 4, 5]
5 [2, 3, 4]
6 [3, 4, 5]
7 [2, 3, 4]
8 [3, 4, 5]
9 [2, 3, 4]
10 [3, 4, 5]
dtype: object
In [17]: sm = np.array(s.tolist())
In [18]: sm
Out[18]:
array([[1, 2, 3],
[2, 3, 4],
[3, 4, 5],
[2, 3, 4],
[3, 4, 5],
[2, 3, 4],
[3, 4, 5],
[2, 3, 4],
[3, 4, 5],
[2, 3, 4],
[3, 4, 5]])
In [19]: sm.shape
Out[19]: (11, 3)
But unless it's something you can't change, having that Series makes little sense to begin with.
I tested above methods with 5793 of 100D vectors. The old method, converting to list first, is fastest.
%time print(np.stack(df.features.values).shape)
%time print(np.stack(df.features.to_numpy()).shape)
%time print(np.array(df.features.tolist()).shape)
%time print(np.array(list(df.features)).shape)
Result
(5793, 100)
CPU times: user 11.7 ms, sys: 3.42 ms, total: 15.1 ms
Wall time: 22.7 ms
(5793, 100)
CPU times: user 11.1 ms, sys: 137 µs, total: 11.3 ms
Wall time: 11.9 ms
(5793, 100)
CPU times: user 5.96 ms, sys: 0 ns, total: 5.96 ms
Wall time: 6.91 ms
(5793, 100)
CPU times: user 5.74 ms, sys: 0 ns, total: 5.74 ms
Wall time: 6.43 ms