Speed up to_sql() when writing Pandas DataFrame to Oracle database using SqlAlchemy and cx_Oracle

Pandas + SQLAlchemy per default save all object (string) columns as CLOB in Oracle DB, which makes insertion extremely slow.

Here are some tests:

import pandas as pd
import cx_Oracle
from sqlalchemy import types, create_engine

#######################################################
### DB connection strings config
#######################################################
tns = """
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = my-db-scan)(PORT = 1521))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = my_service_name)
    )
  )
"""

usr = "test"
pwd = "my_oracle_password"

engine = create_engine('oracle+cx_oracle://%s:%s@%s' % (usr, pwd, tns))

# sample DF [shape: `(2000, 11)`]
# i took your 2 rows DF and replicated it: `df = pd.concat([df]* 10**3, ignore_index=True)`
df = pd.read_csv('/path/to/file.csv')

DF info:

In [61]: df.shape
Out[61]: (2000, 11)

In [62]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 11 columns):
id               2000 non-null int64
name             2000 non-null object
premium          2000 non-null float64
created_date     2000 non-null datetime64[ns]
init_p           2000 non-null float64
term_number      2000 non-null int64
uprate           1000 non-null float64
value            2000 non-null int64
score            2000 non-null float64
group            2000 non-null int64
action_reason    2000 non-null object
dtypes: datetime64[ns](1), float64(4), int64(4), object(2)
memory usage: 172.0+ KB

Let's check how long will it take to store it to Oracle DB:

In [57]: df.shape
Out[57]: (2000, 11)

In [58]: %timeit -n 1 -r 1 df.to_sql('test_table', engine, index=False, if_exists='replace')
1 loop, best of 1: 16 s per loop

In Oracle DB (pay attention at CLOB's):

AAA> desc test.test_table
 Name                            Null?    Type
 ------------------------------- -------- ------------------
 ID                                       NUMBER(19)
 NAME                                     CLOB        #  !!!
 PREMIUM                                  FLOAT(126)
 CREATED_DATE                             DATE
 INIT_P                                   FLOAT(126)
 TERM_NUMBER                              NUMBER(19)
 UPRATE                                   FLOAT(126)
 VALUE                                    NUMBER(19)
 SCORE                                    FLOAT(126)
 group                                    NUMBER(19)
 ACTION_REASON                            CLOB        #  !!!

Now let's instruct pandas to save all object columns as VARCHAR data types:

In [59]: dtyp = {c:types.VARCHAR(df[c].str.len().max())
    ...:         for c in df.columns[df.dtypes == 'object'].tolist()}
    ...:

In [60]: %timeit -n 1 -r 1 df.to_sql('test_table', engine, index=False, if_exists='replace', dtype=dtyp)
1 loop, best of 1: 335 ms per loop

This time it was approx. 48 times faster

Check in Oracle DB:

 AAA> desc test.test_table
 Name                          Null?    Type
 ----------------------------- -------- ---------------------
 ID                                     NUMBER(19)
 NAME                                   VARCHAR2(13 CHAR)        #  !!!
 PREMIUM                                FLOAT(126)
 CREATED_DATE                           DATE
 INIT_P                                 FLOAT(126)
 TERM_NUMBER                            NUMBER(19)
 UPRATE                                 FLOAT(126)
 VALUE                                  NUMBER(19)
 SCORE                                  FLOAT(126)
 group                                  NUMBER(19)
 ACTION_REASON                          VARCHAR2(8 CHAR)        #  !!!

Let's test it with 200.000 rows DF:

In [69]: df.shape
Out[69]: (200000, 11)

In [70]: %timeit -n 1 -r 1 df.to_sql('test_table', engine, index=False, if_exists='replace', dtype=dtyp, chunksize=10**4)
1 loop, best of 1: 4.68 s per loop

It took ~5 seconds for 200K rows DF in my test (not the fastest) environment.

Conclusion: use the following trick in order to explicitly specify dtype for all DF columns of object dtype when saving DataFrames to Oracle DB. Otherwise it'll be saved as CLOB data type, which requires special treatment and makes it very slow

dtyp = {c:types.VARCHAR(df[c].str.len().max())
        for c in df.columns[df.dtypes == 'object'].tolist()}

df.to_sql(..., dtype=dtyp)

Speed up to_sql() when writing Pandas DataFrame to Oracle database using SqlAlchemy and cx_Oracle

Tags:

Pandas

Performance

Oracle

Sqlalchemy

Dataframe

Related

Recent Posts