Merge CSVs in Python with different columns
For those of us using 2.7, this adds an extra linefeed between records in "out.csv". To resolve this, just change the file mode from "w" to "wb".
The csv.DictReader
and csv.DictWriter
classes should work well (see Python docs). Something like this:
import csv
inputs = ["in1.csv", "in2.csv"] # etc
# First determine the field names from the top line of each input file
# Comment 1 below
fieldnames = []
for filename in inputs:
with open(filename, "r", newline="") as f_in:
reader = csv.reader(f_in)
headers = next(reader)
for h in headers:
if h not in fieldnames:
fieldnames.append(h)
# Then copy the data
with open("out.csv", "w", newline="") as f_out: # Comment 2 below
writer = csv.DictWriter(f_out, fieldnames=fieldnames)
for filename in inputs:
with open(filename, "r", newline="") as f_in:
reader = csv.DictReader(f_in) # Uses the field names in this file
for line in reader:
# Comment 3 below
writer.writerow(line)
Comments from above:
- You need to specify all the possible field names in advance to
DictWriter
, so you need to loop through all your CSV files twice: once to find all the headers, and once to read the data. There is no better solution, because all the headers need to be known beforeDictWriter
can write the first line. This part would be more efficient using sets instead of lists (thein
operator on a list is comparatively slow), but it won't make much difference for a few hundred headers. Sets would also lose the deterministic ordering of a list - your columns would come out in a different order each time you ran the code. - The above code is for Python 3, where weird things happen in the CSV module without
newline=""
. Remove this for Python 2. - At this point,
line
is a dict with the field names as keys, and the column data as values. You can specify what to do with blank or unknown values in theDictReader
andDictWriter
constructors.
This method should not run out of memory, because it never has the whole file loaded at once.