Combining big data files with different columns into one big file
From an algorithm point of view I would take the following steps:
Process the headers:
- read all headers of all input files and extract all column names
- sort the column names in the order you want
- create a lookup table which returns the column-name when a field number is given (
h[n] -> "name"
)process the files: after the headers, you can reprocess the files
- read the header of the file
- create a lookup table which returns the field number when given a column name. An associative array is useful here: (
a["name"] -> field_number
)process the remainder of the file
- loop over all fields of the merged file
- get the column name with
h
- check if the column name is in
a
, if not print-
, if so print the field number corresponding witha
.
This is easily done with a GNU awk making use of the extensions nextfile
and asorti
. The nextfile
function allows us to read the header only and move to the next file without processing the full file. Since we need to process the file twice (step 1 reading the header and step 2 reading the file), we will ask awk to dynamically manipulate its argument list. Every time a file's header is processed, we add it at the end of the argument list ARGV
so it can be used for step 2
.
BEGIN { s="-" } # define symbol
BEGIN { f=ARGC-1 } # get total number of files
f { for (i=1;i<=NF;++i) h[$i] # read headers in associative array h[key]
ARGV[ARGC++] = FILENAME # add file at end of argument list
if (--f == 0) { # did we process all headers?
n=asorti(h) # sort header into h[idx] = key
for (i=1;i<=n;++i) # print header
printf "%s%s", h[i], (i==n?ORS:OFS)
}
nextfile # end of processing headers
}
# Start of processing the files
(FNR==1) { delete a; for(i=1;i<=NF;++i) a[$i]=i; next } # read header
{ for(i=1;i<=n;++i) printf "%s%s", (h[i] in a ? $(a[h[i]]) : s), (i==n?ORS:OFS) }
If you store the above in a file merge.awk
you can use the command:
awk -f merge.awk f1 f2 f3 f4 ... fx
A similar way, but less hastle with f
:
BEGIN { s="-" } # define symbol
BEGIN { # modify argument list from
c=ARGC; # from: arg1 arg2 ... argx
ARGV[ARGC++]="f=1" # to: arg1 arg2 ... argx f=1 arg1 arg2 ... argx
for(i=1;i<c;++i) ARGV[ARGC++]=ARGV[i]
}
!f { for (i=1;i<=NF;++i) h[$i] # read headers in associative array h[key]
nextfile
}
(f==1) && (FNR==1) { # process merged header
n=asorti(h) # sort header into h[idx] = key
for (i=1;i<=n;++i) # print header
printf "%s%s", h[i], (i==n?ORS:OFS)
f=2
}
# Start of processing the files
(FNR==1) { delete a; for(i=1;i<=NF;++i) a[$i]=i; next } # read header
{ for(i=1;i<=n;++i) printf "%s%s", (h[i] in a ? $(a[h[i]]) : s), (i==n?ORS:OFS) }
This method is slightly different, but allows the processing of files with different field separators as
awk -f merge.awk f1 FS="," f2 f3 FS="|" f4 ... fx
If your argument list becomes too long, you can use awk
to create it for you :
BEGIN { s="-" } # define symbol
BEGIN { # read argument list from input file:
fname=(ARGC==1 ? "-" : ARGV[1])
ARGC=1 # from: filelist or /dev/stdin
while ((getline < fname) > 0) # to: arg1 arg2 ... argx
ARGV[ARGC++]=$0
}
BEGIN { # modify argument list from
c=ARGC; # from: arg1 arg2 ... argx
ARGV[ARGC++]="f=1" # to: arg1 arg2 ... argx f=1 arg1 arg2 ... argx
for(i=1;i<c;++i) ARGV[ARGC++]=ARGV[i]
}
!f { for (i=1;i<=NF;++i) h[$i] # read headers in associative array h[key]
nextfile
}
(f==1) && (FNR==1) { # process merged header
n=asorti(h) # sort header into h[idx] = key
for (i=1;i<=n;++i) # print header
printf "%s%s", h[i], (i==n?ORS:OFS)
f=2
}
# Start of processing the files
(FNR==1) { delete a; for(i=1;i<=NF;++i) a[$i]=i; next } # read header
{ for(i=1;i<=n;++i) printf "%s%s", (h[i] in a ? $(a[h[i]]) : s), (i==n?ORS:OFS) }
which can be ran as:
$ awk -f merge.awk filelist
$ find . | awk -f merge.awk "-"
$ find . | awk -f merge.awk
or any similar command.
As you see, by adding only a tiny block of code, we were able to flexibly adjust to awk code to support our needs.
Miller (johnkerl/miller) is so underused when dealing with huge files. It has tons of features included from all useful file processing tools out there. Like the official documentation says
Miller is like
awk, sed, cut, join
, andsort
for name-indexed data such as CSV, TSV, and tabular JSON. You get to work with your data using named fields, without needing to count positional column indices.
For this particular case, it supports a verb unsparsify, which by the documentation says
Prints records with the union of field names over all input records. For field names absent in a given record but present in others, fills in a value. This verb retains all input before producing any output.
You just need to do the following and reorder the file back with the column positions as you desire
mlr --tsvlite --opprint unsparsify then reorder -f a,b,c,d,e,f file{1..3}.dat
which produces the output in one-shot as
a b c d e f g
5 7 2 - - - -
3 9 1 - - - -
2 9 - - 8 3 -
2 8 - - 3 3 -
1 0 - - 3 2 -
1 - 1 5 - - 2
You can even customize what characters you can use to fill the empty fields with, with default being -
. For custom characters use unsparsify --fill-with '#'
A brief explanation of the fields used
- To delimit the input stream as a tab delimited content,
--tsvlite
- To pretty print the tabular data
--opprint
- And
unsparsify
like explained above does a union of all the field names over all input stream - The reordering verb
reorder
is needed because the column headers appear in random order between the files. So to define the order explicitly, use the-f
option with the column headers you want the output to appear with.
And installation of the package is so straightforward. Miller is written in portable, modern C, with zero runtime dependencies. The installation via package managers is so easy and it supports all major package managers Homebrew, MacPorts, apt-get
, apt
and yum
.
Given your updated information in comments about having about 10^5 input files (and so exceeding the shells max number of args for a non-builtin command) and wanting the output columns in the order they're seen rather than alphabetically sorted, the following will work using any awk and any find:
$ cat tst.sh
#!/bin/env bash
find . -maxdepth 1 -type f -name "$1" |
awk '
NR==FNR {
fileName = $0
ARGV[ARGC++] = fileName
if ( (getline fldList < fileName) > 0 ) {
if ( !seenList[fldList]++ ) {
numFlds = split(fldList,fldArr)
for (inFldNr=1; inFldNr<=numFlds; inFldNr++) {
fldName = fldArr[inFldNr]
if ( !seenName[fldName]++ ) {
hdr = (numOutFlds++ ? hdr OFS : "") fldName
outNr2name[numOutFlds] = fldName
}
}
}
}
close(fileName)
next
}
FNR == 1 {
if ( !doneHdr++ ) {
print hdr
}
delete name2inNr
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
fldName = $inFldNr
name2inNr[fldName] = inFldNr
}
next
}
{
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
fldName = outNr2name[outFldNr]
inFldNr = name2inNr[fldName]
fldValue = (inFldNr ? $inFldNr : "-")
printf "%s%s", fldValue, (outFldNr<numOutFlds ? OFS : ORS)
}
}
' -
.
$ ./tst.sh 'file*.dat'
a b c e f d g
5 7 2 - - - -
3 9 1 - - - -
2 9 - 8 3 - -
2 8 - 3 3 - -
1 0 - 3 2 - -
1 - 1 - - 5 2
Note that input to the script is now the globbing pattern you want find
to use to find the files, not the list of files.
Original answer:
If you don't mind a combined shell+awk script then this will work using any awk:
$ cat tst.sh
#!/bin/env bash
awk -v hdrs="$(head -1 -q "$@" | tr ' ' '\n' | sort -u)" '
BEGIN {
numOutFlds = split(hdrs,outNr2name)
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
fldName = outNr2name[outFldNr]
printf "%s%s", fldName, (outFldNr<numOutFlds ? OFS : ORS)
}
}
FNR == 1 {
delete name2inNr
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
fldName = $inFldNr
name2inNr[fldName] = inFldNr
}
next
}
{
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
fldName = outNr2name[outFldNr]
inFldNr = name2inNr[fldName]
fldValue = (inFldNr ? $inFldNr : "-")
printf "%s%s", fldValue, (outFldNr<numOutFlds ? OFS : ORS)
}
}
' "$@"
.
$ ./tst.sh file{1..3}.dat
a b c d e f g
5 7 2 - - - -
3 9 1 - - - -
2 9 - - 8 3 -
2 8 - - 3 3 -
1 0 - - 3 2 -
1 - 1 5 - - 2
otherwise this is all awk using GNU awk for arrays of arrays, sorted_in, and ARGIND:
$ cat tst.awk
BEGIN {
for (inFileNr=1; inFileNr<ARGC; inFileNr++) {
inFileName = ARGV[inFileNr]
if ( (getline < inFileName) > 0 ) {
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
fldName = $inFldNr
name2inNr[fldName][inFileNr] = inFldNr
}
}
close(inFileName)
}
PROCINFO["sorted_in"] = "@ind_str_asc"
for (fldName in name2inNr) {
printf "%s%s", (numOutFlds++ ? OFS : ""), fldName
for (inFileNr in name2inNr[fldName]) {
outNr2inNr[numOutFlds][inFileNr] = name2inNr[fldName][inFileNr]
}
}
print ""
}
FNR > 1 {
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
inFldNr = outNr2inNr[outFldNr][ARGIND]
fldValue = (inFldNr ? $inFldNr : "-")
printf "%s%s", fldValue, (outFldNr<numOutFlds ? OFS : ORS)
}
}
.
$ awk -f tst.awk file{1..3}.dat
a b c d e f g
5 7 2 - - - -
3 9 1 - - - -
2 9 - - 8 3 -
2 8 - - 3 3 -
1 0 - - 3 2 -
1 - 1 5 - - 2
For efficiency the 2nd script above does all the heavy lifting in the BEGIN section so there's as little work left to do as possible in the main body of the script that's evaluated once per input line. In the BEGIN section it creates an associative array (outNr2inNr[]
) that maps the outgoing field numbers (alphabetically sorted list of all field names across all input files) to the incoming field numbers so all that's left to do in the body is print the fields in that order.