Performance issue while reading data from hive using python
I have tried with multi-processing and i can reduce it 8-10 minutes from 2 hours. Please find below scripts.
from multiprocessing import Pool
import pandas as pd
import datetime
from query import hivetable
from write_tosql import write_to_sql
p = Pool(37)
lst=[]
#we have 351k rows so generating series to use in hivetable method
for i in range(1,360000,10000):
lst.append(i)
print 'started reading ',datetime.datetime.now()
#we have 40 cores in cluster
p = Pool(37)
s=p.map(hivetable, [i for i in lst])
s_df=pd.concat(s)
print 'finished reading ',datetime.datetime.now()
print 'Started writing to sql server ',datetime.datetime.now()
write_to_sql(s_df)
print 'Finished writing to sql server ',datetime.datetime.now()
---------query.py file-------
import pyodbc
from multiprocessing import Pool
from functools import partial
import pandas as pd
conn = pyodbc.connect("DSN=******",autocommit=True)
def hivetable(row):
query = 'select * from (select row_number() OVER (order by policynumber) as rownum, * from dbg.tble ) tbl1 where rownum between '+str(row) +' and '+str(row+9999)+';'
result = pd.read_sql(query,conn)
return result
---------Write_tosql.py file---------
import sqlalchemy
import urllib
import pyodbc
def write_to_sql(s_df):
sql_conn_url = urllib.quote_plus('DRIVER={ODBC Driver 13 for SQL Server};SERVER=ser;DATABASE=db;UID=sqoop;PWD=#####;')
sql_conn_str = "mssql+pyodbc:///?odbc_connect={0}".format(sql_conn_url)
engine = sqlalchemy.create_engine(sql_conn_str)
s_df.rename(columns=lambda x: remove_table_alias(x), inplace=True)
s_df.to_sql(name='tbl2', schema='dbo', con=engine, chunksize=10000, if_exists="append", index=False)
def remove_table_alias(columnName):
try:
if(columnName.find(".") != -1):
return columnName.split(".")[1]
return columnName
except Exception, e:
print "ERROR in _remove_table_alias ",str(e)
Any other solutions will help me to reduce in time.
What is the best way to read the output from disk with Pandas after using cmd.get_results ? (e.g. from a Hive command). For example, consider the following:
out_file = 'results.csv'
delimiter = chr(1)
....
Qubole.configure(qubole_key)
hc_params = ['--query', query]
hive_args = HiveCommand.parse(hc_params)
cmd = HiveCommand.run(**hive_args)
if (HiveCommand.is_success(cmd.status)):
with open(out_file, 'wt') as writer:
cmd.get_results(writer, delim=delimiter, inline=False)
If, after successfully running the query, I then inspect the first few bytes of results.csv, I see the following:
$ head -c 300 results.csv
b'flight_uid\twinning_price\tbid_price\timpressions_source_timestamp\n'b'0FY6ZsrnMy\x012000\x012270.0\x011427243278000\n0FamrXG9AW\x01710\x01747.0\x011427243733000\n0FY6ZsrnMy\x012000\x012270.0\x011427245266000\n0FY6ZsrnMy\x012000\x012270.0\x011427245088000\n0FamrXG9AW\x01330\x01747.0\x011427243407000\n0FamrXG9AW\x01710\x01747.0\x011427243981000\n0FamrXG9AW\x01490\x01747.0\x011427245289000\n
When I try to open this in Pandas:
df = pd.read_csv('results.csv')
it obviously doesn't work (I get an empty DataFrame), since it isn't properly formatted as a csv file. While I could try to open results.csv and post-process it (to remove b', etc.) before I open it in Pandas, this would be a quite hacky way to load it. Am I using the interface correctly? This is using the very last version of qds_sdk: 1.4.2 from a three hours ago.