How to read multiple json files into pandas dataframe?
If you need to flatten the JSON, Juan Estevez’s approach won’t work as is. Here is an alternative :
import pandas as pd
dfs = []
for file in file_list:
with open(file) as f:
json_data = pd.json_normalize(json.loads(f.read()))
dfs.append(json_data)
df = pd.concat(dfs, sort=False) # or sort=True depending on your needs
Or if your JSON are line-delimited (not tested) :
import pandas as pd
dfs = []
for file in file_list:
with open(file) as f:
for line in f.readlines():
json_data = pd.json_normalize(json.loads(line))
dfs.append(json_data)
df = pd.concat(dfs, sort=False) # or sort=True depending on your needs
Change the last line to:
temp = temp.append(data, ignore_index = True)
The reason we have to do this is because the append doesn't happen in place. The append method does not modify the data frame. It just returns a new data frame with the result of the append operation.
Edit:
Since writing this answer I have learned that you should never use DataFrame.append
inside a loop because it leads to quadratic copying (see this answer).
What you should do instead is first create a list of data frames and then use pd.concat
to concatenate them all in a single operation. Like this:
dfs = [] # an empty list to store the data frames
for file in file_list:
data = pd.read_json(file, lines=True) # read data frame from json file
dfs.append(data) # append the data frame to the list
temp = pd.concat(dfs, ignore_index=True) # concatenate all the data frames in the list.
This alternative should be considerably faster.
I combined Juan Estevez's answer with glob. Thanks a lot.
import pandas as pd
import glob
def readFiles(path):
files = glob.glob(path)
dfs = [] # an empty list to store the data frames
for file in files:
data = pd.read_json(file, lines=True) # read data frame from json file
dfs.append(data) # append the data frame to the list
df = pd.concat(dfs, ignore_index=True) # concatenate all the data frames in the list.
return df
from pathlib import Path
import pandas as pd
paths = Path("/home/data").glob("*.json")
df = pd.DataFrame([pd.read_json(p, typ="series") for p in paths])```