Best way to Insert Python NumPy array into PostgreSQL database

Not sure if this is what you are after, but assuming you have read/write access to an existing postgres DB:

import numpy as np
import psycopg2 as psy
import pickle

db_connect_kwargs = {
    'dbname': '<YOUR_DBNAME>',
    'user': '<YOUR_USRNAME>',
    'password': '<YOUR_PWD>',
    'host': '<HOST>',
    'port': '<PORT>'
}

connection = psy.connect(**db_connect_kwargs)
connection.set_session(autocommit=True)
cursor = connection.cursor()

cursor.execute(
    """
    DROP TABLE IF EXISTS numpy_arrays;
    CREATE TABLE numpy_arrays (
        uuid VARCHAR PRIMARY KEY,
        np_array_bytes BYTEA
    )
    """
)

The gist of this approach is to store any numpy array (of arbitrary shape and data type) as a row in the numpy_arrays table, where uuid is a unique identifier to be able to later retrieve the array. The actual array would be saved in the np_array_bytes column as bytes.

Inserting into the database:

some_array = np.random.rand(1500,550)
some_array_uuid = 'some_array'

cursor.execute(
    """
    INSERT INTO numpy_arrays(uuid, np_array_bytes)
    VALUES (%s, %s)
    """,
    (some_array_uuid, pickle.dumps(some_array))
)

Querying from the database:

uuid = 'some_array'
cursor.execute(
    """
    SELECT np_array_bytes
    FROM numpy_arrays
    WHERE uuid=%s
    """,
    (uuid,)
)
some_array = pickle.loads(cursor.fetchone()[0])

Performance?

If we could store our NumPy arrays directly in PostgreSQL we would get a major performance boost.

I haven't benchmarked this approach in any way, so I can't confirm nor refute this...

Disk Space?

My guess is that this approach takes as much disk space as dumping the arrays to a file using np.save('some_array.npy', some_array). If this is an issue consider compressing the bytes before insertion.

Best way to Insert Python NumPy array into PostgreSQL database

Tags:

Python

Sql

Postgresql

Numpy

Related

Recent Posts