How to keep column names when converting from pandas to numpy
Create an example:
import pandas
import numpy
PandasTable = pandas.DataFrame( {
"AAA": [4, 5, 6, 7],
"BBB": [10, 20, 30, 40],
"CCC": [100, 50, -30, -50],
"DDD": ['asdf1', 'asdf2', 'asdf3', 'asdf4'] } )
Solve the problem noting that we are creating something called a "structured numpy array":
NumpyDtypes = list( PandasTable.dtypes.items() )
NumpyTable = PandasTable.to_numpy(copy=True)
NumpyTableRows = [ tuple(Row) for Row in NumpyTable]
NumpyTableWithHeaders = numpy.array( NumpyTableRows, dtype=NumpyDtypes )
Rewrite the solution in 1 line of code:
NumpyTableWithHeaders2 = numpy.array( [ tuple(Row) for Row in PandasTable.to_numpy(copy=True)], dtype=list( PandasTable.dtypes.items() ) )
Print out the solution results:
print ('NumpyTableWithHeaders', NumpyTableWithHeaders)
print ('NumpyTableWithHeaders.dtype', NumpyTableWithHeaders.dtype)
print ('NumpyTableWithHeaders2', NumpyTableWithHeaders2)
print ('NumpyTableWithHeaders2.dtype', NumpyTableWithHeaders2.dtype)
NumpyTableWithHeaders [(4, 10, 100, 'asdf1') (5, 20, 50, 'asdf2') (6, 30, -30, 'asdf3') (7, 40, -50, 'asdf4')] NumpyTableWithHeaders.dtype [('AAA', '<i8'), ('BBB', '<i8'), ('CCC', '<i8'), ('DDD', 'O')] NumpyTableWithHeaders2 [(4, 10, 100, 'asdf1') (5, 20, 50, 'asdf2') (6, 30, -30, 'asdf3') (7, 40, -50, 'asdf4')] NumpyTableWithHeaders2.dtype [('AAA', '<i8'), ('BBB', '<i8'), ('CCC', '<i8'), ('DDD', 'O')]
Documentation I had to read
Adding row/column headers to NumPy arrays
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html
How to keep column names when converting from pandas to numpy
https://numpy.org/doc/stable/user/basics.creation.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html
https://docs.scipy.org/doc/numpy-1.10.1/user/basics.rec.html
Notes and thoughts: Pandas should add a flag in their 'to_numpy' function which does this. Recent version Numpy documentation should be updated to include structured arrays, which behave differently than regular ones.
Pandas dataframe also has a handy to_records
method. Demo:
X = pd.DataFrame(dict(age=[40., 50., 60.],
sys_blood_pressure=[140.,150.,160.]))
m = X.to_records(index=False)
print repr(m)
Returns:
rec.array([(40.0, 140.0), (50.0, 150.0), (60.0, 160.0)],
dtype=[('age', '<f8'), ('sys_blood_pressure', '<f8')])
This is a "record array", which is an ndarray subclass that allows field access using attributes, e.g. m.age
in addition to m['age']
.
You can pass this to a cython function as a regular float array by constructing a view:
m_float = m.view(float).reshape(m.shape + (-1,))
print repr(m_float)
Which gives:
rec.array([[ 40., 140.],
[ 50., 150.],
[ 60., 160.]],
dtype=float64)
Note in order for this to work, the original Dataframe must have a float dtype for every column. To make sure use m = X.astype(float, copy=False).to_records(index=False)
.
Yet more methods of converting a pandas.DataFrame
to numpy.array
while preserving label/column names
This is mainly for demonstrating how to set
dtype
/column_dtypes
, because sometimes a data source iterator's output'll need some pre-normalization.
Method one inserts by column into a zeroed array of predefined height and is loosely based on a Creating Structured Arrays guide that just a bit of web-crawling turned up
import numpy
def to_tensor(dataframe, columns = [], dtypes = {}):
# Use all columns from data frame if none where listed when called
if len(columns) <= 0:
columns = dataframe.columns
# Build list of dtypes to use, updating from any `dtypes` passed when called
dtype_list = []
for column in columns:
if column not in dtypes.keys():
dtype_list.append(dataframe[column].dtype)
else:
dtype_list.append(dtypes[column])
# Build dictionary with lists of column names and formatting in the same order
dtype_dict = {
'names': columns,
'formats': dtype_list
}
# Initialize _mostly_ empty nupy array with column names and formatting
numpy_buffer = numpy.zeros(
shape = len(dataframe),
dtype = dtype_dict)
# Insert values from dataframe columns into numpy labels
for column in columns:
numpy_buffer[column] = dataframe[column].to_numpy()
# Return results of conversion
return numpy_buffer
Method two is based on user7138814's answer and will likely be more efficient as it is basically a wrapper for the built in to_records
method available to pandas.DataFrame
s
def to_tensor(dataframe, columns = [], dtypes = {}, index = False):
to_records_kwargs = {'index': index}
if not columns: # Default to all `dataframe.columns`
columns = dataframe.columns
if dtypes: # Pull in modifications only for dtypes listed in `columns`
to_records_kwargs['column_dtypes'] = {}
for column in dtypes.keys():
if column in columns:
to_records_kwargs['column_dtypes'].update({column: dtypes.get(column)})
return dataframe[columns].to_records(**to_records_kwargs)
With either of the above one could do...
X = pandas.DataFrame(dict(age = [40., 50., 60.], sys_blood_pressure = [140., 150., 160.]))
# Example of overwriting dtype for a column
X_tensor = to_tensor(X, dtypes = {'age': 'int32'})
print("Ages -> {0}".format(X_tensor['age']))
print("SBPs -> {0}".format(X_tensor['sys_blood_pressure']))
... which should output...
Ages -> array([40, 50, 60])
SBPs -> array([140., 150., 160.])
... and a full dump of X_tensor
should look like the following.
array([(40, 140.), (50, 150.), (60, 160.)],
dtype=[('age', '<i4'), ('sys_blood_pressure', '<f8')])
Some thoughts
While method two will likely be more efficient than the first, method one (with some modifications) may be more useful for merging two or more pandas.DataFrame
s into one numpy.array
.
Additionally (after swinging back through to review), method one will likely face-plant as it's written with errors about to_records_kwargs
not being a mapping if dtypes
is not defined, next time I'm feeling Pythonic I may resolve that with an else
condition.
Consider a DF
as shown below:
X = pd.DataFrame(dict(one=['Strawberry', 'Fields', 'Forever'], two=[1,2,3]))
X
Provide a list of tuples as data input to the structured array:
arr_ip = [tuple(i) for i in X.as_matrix()]
Ordered list of field names:
dtyp = np.dtype(list(zip(X.dtypes.index, X.dtypes)))
Here, X.dtypes.index
gives you the column names and X.dtypes
it's corresponding dtypes which are unified again into a list of tuples and fed as input to the dtype elements to be constructed.
arr = np.array(arr_ip, dtype=dtyp)
gives:
arr
# array([('Strawberry', 1), ('Fields', 2), ('Forever', 3)],
# dtype=[('one', 'O'), ('two', '<i8')])
and
arr.dtype.names
# ('one', 'two')