How to import a mysqldump into Pandas
One way is to export mysqldump to sqlite (e.g. run this shell script) then read the sqlite file/database.
See the SQL section of the docs:
pd.read_sql_table(table_name, sqlite_file)
Another option is just to run read_sql
on the mysql database directly...
I found myself in a similar situation to yours, and the answer from @firelynx was really helpful!
But since I had only limited knowledge of the tables included in the file, I extended the script by adding the header generation (pandas picks it up automatically), as well as searching for all the tables within the dump file. As a result, I ended up with a following script, that indeed works extremely fast. I switched to io.StringIO
, and save the resulting tables as table_name.csv
files.
P.S. I also support the advise against relying on this approach, and provide the code just for illustration purposes :)
So, first thing first, we can augment the read_dump
function like this
from io import StringIO
import re, shutil
def read_dump(dump_filename, target_table):
sio = StringIO()
read_mode = 0 # 0 - skip, 1 - header, 2 - data
with open(dump_filename, 'r') as f:
for line in f:
line = line.strip()
if line.lower().startswith('insert') and target_table in line:
read_mode = 2
if line.lower().startswith('create table') and target_table in line:
read_mode = 1
continue
if read_mode==0:
continue
# Filling up the headers
elif read_mode==1:
if line.lower().startswith('primary'):
# add more conditions here for different cases
#(e.g. when simply a key is defined, or no key is defined)
read_mode=0
sio.seek(sio.tell()-1) # delete last comma
sio.write('\n')
continue
colheader = re.findall('`([\w_]+)`',line)
for col in colheader:
sio.write(col.strip())
sio.write(',')
# Filling up the data -same as @firelynx's code
elif read_mode ==2:
data = re.findall('\([^\)]*\)', line)
try:
# ...
except IndexError:
pass
if line.endswith(';'):
break
sio.seek(0)
with open (target_table+'.csv', 'w') as fd:
shutil.copyfileobj(sio, fd,-1)
return # or simply return sio itself
To find the list of tables we can use the following function:
def find_tables(dump_filename):
table_list=[]
with open(dump_filename, 'r') as f:
for line in f:
line = line.strip()
if line.lower().startswith('create table'):
table_name = re.findall('create table `([\w_]+)`', line.lower())
table_list.extend(table_name)
return table_list
Then just combine the two, for example in a .py script that you'll run like
python this_script.py mysqldump_name.sql [table_name]
import os.path
def main():
try:
if len(sys.argv)>=2 and os.path.isfile(sys.argv[1]):
if len(sys.argv)==2:
print('Table name not provided, looking for all tables...')
table_list = find_tables(sys.argv[1])
if len(table_list)>0:
print('Found tables: ',str(table_list))
for table in table_list:
read_dump(sys.argv[1], table)
elif len(sys.argv)==3:
read_dump(sys.argv[1], sys.argv[2])
except KeyboardInterrupt:
sys.exit(0)
No
Pandas has no native way of reading a mysqldump without it passing through a database.
There is a possible workaround, but it is in my opinion a very bad idea.
Workaround (Not recommended for production use)
Of course you could parse the data from the mysqldump file using a preprocessor.
MySQLdump files often contain a lot of extra data we are not interested in when loading a pandas dataframe, so we need to preprocess it and remove noise and even reformat lines so that they conform.
Using StringIO
we can read a file, process the data before it is fed to the pandas.read_csv
funcion
from StringIO import StringIO
import re
def read_dump(dump_filename, target_table):
sio = StringIO()
fast_forward = True
with open(dump_filename, 'rb') as f:
for line in f:
line = line.strip()
if line.lower().startswith('insert') and target_table in line:
fast_forward = False
if fast_forward:
continue
data = re.findall('\([^\)]*\)', line)
try:
newline = data[0]
newline = newline.strip(' ()')
newline = newline.replace('`', '')
sio.write(newline)
sio.write("\n")
except IndexError:
pass
if line.endswith(';'):
break
sio.pos = 0
return sio
Now that we have a function that reads and formatts the data to look like a CSV file, we can read it with pandas.read_csv()
import pandas as pd
food_min_filedata = read_dump('mysqldumpexample', 'food_min')
food_in_filedata = read_dump('mysqldumpexample', 'food_in')
df_food_min = pd.read_csv(food_min_filedata)
df_food_in = pd.read_csv(food_in_filedata)
Results in:
Item quantity
0 'Pizza' '300'
1 'Hotdogs' '200'
2 'Beer' '300'
3 'Water' '230'
4 'Soda' '399'
5 'Soup' '100'
and
ID Cat Item price quantity
0 2 'Liq' 'Beer' '2.5000' '300'
1 7 'Liq' 'Water' '3.5000' '230'
2 9 'Liq' 'Soda' '3.5000' '399'
Note on Stream processing
This approach is called stream processing and is incredibly streamlined, almost taking no memory at all. In general it is a good idea to use this approach to read csv files more efficiently into pandas.
It is the parsing of a mysqldump file I advice against