Best way to Insert Python NumPy array into PostgreSQL database
Not sure if this is what you are after, but assuming you have read/write access to an existing postgres DB:
import numpy as np
import psycopg2 as psy
import pickle
db_connect_kwargs = {
'dbname': '<YOUR_DBNAME>',
'user': '<YOUR_USRNAME>',
'password': '<YOUR_PWD>',
'host': '<HOST>',
'port': '<PORT>'
}
connection = psy.connect(**db_connect_kwargs)
connection.set_session(autocommit=True)
cursor = connection.cursor()
cursor.execute(
"""
DROP TABLE IF EXISTS numpy_arrays;
CREATE TABLE numpy_arrays (
uuid VARCHAR PRIMARY KEY,
np_array_bytes BYTEA
)
"""
)
The gist of this approach is to store any numpy array (of arbitrary shape and data type) as a row in the numpy_arrays
table, where uuid
is a unique identifier to be able to later retrieve the array. The actual array would be saved in the np_array_bytes
column as bytes.
Inserting into the database:
some_array = np.random.rand(1500,550)
some_array_uuid = 'some_array'
cursor.execute(
"""
INSERT INTO numpy_arrays(uuid, np_array_bytes)
VALUES (%s, %s)
""",
(some_array_uuid, pickle.dumps(some_array))
)
Querying from the database:
uuid = 'some_array'
cursor.execute(
"""
SELECT np_array_bytes
FROM numpy_arrays
WHERE uuid=%s
""",
(uuid,)
)
some_array = pickle.loads(cursor.fetchone()[0])
Performance?
If we could store our NumPy arrays directly in PostgreSQL we would get a major performance boost.
I haven't benchmarked this approach in any way, so I can't confirm nor refute this...
Disk Space?
My guess is that this approach takes as much disk space as dumping the arrays to a file using np.save('some_array.npy', some_array)
. If this is an issue consider compressing the bytes before insertion.