Combining big data files with different columns into one big file

From an algorithm point of view I would take the following steps:

Process the headers:

read all headers of all input files and extract all column names

sort the column names in the order you want

create a lookup table which returns the column-name when a field number is given (h[n] -> "name")

process the files: after the headers, you can reprocess the files

read the header of the file

create a lookup table which returns the field number when given a column name. An associative array is useful here: (a["name"] -> field_number)

process the remainder of the file

loop over all fields of the merged file

get the column name with h

check if the column name is in a, if not print -, if so print the field number corresponding with a.

This is easily done with a GNU awk making use of the extensions nextfile and asorti. The nextfile function allows us to read the header only and move to the next file without processing the full file. Since we need to process the file twice (step 1 reading the header and step 2 reading the file), we will ask awk to dynamically manipulate its argument list. Every time a file's header is processed, we add it at the end of the argument list ARGV so it can be used for step 2.

BEGIN { s="-" }                # define symbol
BEGIN { f=ARGC-1 }             # get total number of files
f { for (i=1;i<=NF;++i) h[$i]  # read headers in associative array h[key]
    ARGV[ARGC++] = FILENAME    # add file at end of argument list
    if (--f == 0) {            # did we process all headers?
       n=asorti(h)             # sort header into h[idx] = key
       for (i=1;i<=n;++i)      # print header
           printf "%s%s", h[i], (i==n?ORS:OFS)
    }
    nextfile                   # end of processing headers
}           
# Start of processing the files
(FNR==1) { delete a; for(i=1;i<=NF;++i) a[$i]=i; next } # read header
{ for(i=1;i<=n;++i) printf "%s%s", (h[i] in a ? $(a[h[i]]) : s), (i==n?ORS:OFS) }

If you store the above in a file merge.awk you can use the command:

awk -f merge.awk f1 f2 f3 f4 ... fx

A similar way, but less hastle with f:

BEGIN { s="-" }                 # define symbol
BEGIN {                         # modify argument list from
        c=ARGC;                 #   from: arg1 arg2  ... argx
        ARGV[ARGC++]="f=1"      #   to:   arg1 arg2  ... argx f=1 arg1 arg2  ... argx
        for(i=1;i<c;++i) ARGV[ARGC++]=ARGV[i]
}
!f { for (i=1;i<=NF;++i) h[$i]  # read headers in associative array h[key]
     nextfile
}
(f==1) && (FNR==1) {            # process merged header
     n=asorti(h)                # sort header into h[idx] = key
     for (i=1;i<=n;++i)         # print header
        printf "%s%s", h[i], (i==n?ORS:OFS)
     f=2                         
}
# Start of processing the files
(FNR==1) { delete a; for(i=1;i<=NF;++i) a[$i]=i; next } # read header
{ for(i=1;i<=n;++i) printf "%s%s", (h[i] in a ? $(a[h[i]]) : s), (i==n?ORS:OFS) }

This method is slightly different, but allows the processing of files with different field separators as

awk -f merge.awk f1 FS="," f2 f3 FS="|" f4 ... fx

If your argument list becomes too long, you can use awk to create it for you :

BEGIN { s="-" }                 # define symbol
BEGIN {                         # read argument list from input file:
  fname=(ARGC==1 ? "-" : ARGV[1])
  ARGC=1                        # from: filelist or /dev/stdin
  while ((getline < fname) > 0) #   to:   arg1 arg2 ... argx
     ARGV[ARGC++]=$0
}
BEGIN {                         # modify argument list from
        c=ARGC;                 #   from: arg1 arg2  ... argx
        ARGV[ARGC++]="f=1"      #   to:   arg1 arg2  ... argx f=1 arg1 arg2  ... argx
        for(i=1;i<c;++i) ARGV[ARGC++]=ARGV[i]
}
!f { for (i=1;i<=NF;++i) h[$i]  # read headers in associative array h[key]
     nextfile
}
(f==1) && (FNR==1) {            # process merged header
     n=asorti(h)                # sort header into h[idx] = key
     for (i=1;i<=n;++i)         # print header
        printf "%s%s", h[i], (i==n?ORS:OFS)
     f=2                         
}
# Start of processing the files
(FNR==1) { delete a; for(i=1;i<=NF;++i) a[$i]=i; next } # read header
{ for(i=1;i<=n;++i) printf "%s%s", (h[i] in a ? $(a[h[i]]) : s), (i==n?ORS:OFS) }

which can be ran as:

$ awk -f merge.awk filelist
$ find . | awk -f merge.awk "-"
$ find . | awk -f merge.awk

or any similar command.

As you see, by adding only a tiny block of code, we were able to flexibly adjust to awk code to support our needs.

Miller (johnkerl/miller) is so underused when dealing with huge files. It has tons of features included from all useful file processing tools out there. Like the official documentation says

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON. You get to work with your data using named fields, without needing to count positional column indices.

For this particular case, it supports a verb unsparsify, which by the documentation says

Prints records with the union of field names over all input records. For field names absent in a given record but present in others, fills in a value. This verb retains all input before producing any output.

You just need to do the following and reorder the file back with the column positions as you desire

mlr --tsvlite --opprint unsparsify then reorder -f a,b,c,d,e,f file{1..3}.dat

which produces the output in one-shot as

a   b   c   d   e   f   g
5   7   2   -   -   -   -
3   9   1   -   -   -   -
2   9   -   -   8   3   -
2   8   -   -   3   3   -
1   0   -   -   3   2   -
1   -   1   5   -   -   2

You can even customize what characters you can use to fill the empty fields with, with default being -. For custom characters use unsparsify --fill-with '#'

A brief explanation of the fields used

To delimit the input stream as a tab delimited content, --tsvlite
To pretty print the tabular data --opprint
And unsparsify like explained above does a union of all the field names over all input stream
The reordering verb reorder is needed because the column headers appear in random order between the files. So to define the order explicitly, use the -f option with the column headers you want the output to appear with.

And installation of the package is so straightforward. Miller is written in portable, modern C, with zero runtime dependencies. The installation via package managers is so easy and it supports all major package managers Homebrew, MacPorts, apt-get, apt and yum.

Given your updated information in comments about having about 10^5 input files (and so exceeding the shells max number of args for a non-builtin command) and wanting the output columns in the order they're seen rather than alphabetically sorted, the following will work using any awk and any find:

$ cat tst.sh
#!/bin/env bash
find . -maxdepth 1 -type f -name "$1" |
awk '
NR==FNR {
    fileName = $0
    ARGV[ARGC++] = fileName
    if ( (getline fldList < fileName) > 0 ) {
        if ( !seenList[fldList]++ ) {
            numFlds = split(fldList,fldArr)
            for (inFldNr=1; inFldNr<=numFlds; inFldNr++) {
                fldName = fldArr[inFldNr]
                if ( !seenName[fldName]++ ) {
                    hdr = (numOutFlds++ ? hdr OFS : "") fldName
                    outNr2name[numOutFlds] = fldName
                }
            }
        }
    }
    close(fileName)
    next
}
FNR == 1 {
    if ( !doneHdr++ ) {
        print hdr
    }
    delete name2inNr
    for (inFldNr=1; inFldNr<=NF; inFldNr++) {
        fldName = $inFldNr
        name2inNr[fldName] = inFldNr
    }
    next
}
{
    for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
        fldName = outNr2name[outFldNr]
        inFldNr = name2inNr[fldName]
        fldValue = (inFldNr ? $inFldNr : "-")
        printf "%s%s", fldValue, (outFldNr<numOutFlds ? OFS : ORS)
    }
}
' -

$ ./tst.sh 'file*.dat'
a b c e f d g
5 7 2 - - - -
3 9 1 - - - -
2 9 - 8 3 - -
2 8 - 3 3 - -
1 0 - 3 2 - -
1 - 1 - - 5 2

Note that input to the script is now the globbing pattern you want find to use to find the files, not the list of files.

Original answer:

If you don't mind a combined shell+awk script then this will work using any awk:

$ cat tst.sh
#!/bin/env bash

awk -v hdrs="$(head -1 -q "$@" | tr ' ' '\n' | sort -u)" '
BEGIN {
    numOutFlds = split(hdrs,outNr2name)
    for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
        fldName = outNr2name[outFldNr]
        printf "%s%s", fldName, (outFldNr<numOutFlds ? OFS : ORS)
    }
}
FNR == 1 {
    delete name2inNr
    for (inFldNr=1; inFldNr<=NF; inFldNr++) {
        fldName = $inFldNr
        name2inNr[fldName] = inFldNr
    }
    next
}
{
    for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
        fldName = outNr2name[outFldNr]
        inFldNr = name2inNr[fldName]
        fldValue = (inFldNr ? $inFldNr : "-")
        printf "%s%s", fldValue, (outFldNr<numOutFlds ? OFS : ORS)
    }
}
' "$@"

$ ./tst.sh file{1..3}.dat
a b c d e f g
5 7 2 - - - -
3 9 1 - - - -
2 9 - - 8 3 -
2 8 - - 3 3 -
1 0 - - 3 2 -
1 - 1 5 - - 2

otherwise this is all awk using GNU awk for arrays of arrays, sorted_in, and ARGIND:

$ cat tst.awk
BEGIN {
    for (inFileNr=1; inFileNr<ARGC; inFileNr++) {
        inFileName = ARGV[inFileNr]
        if ( (getline < inFileName) > 0 ) {
            for (inFldNr=1; inFldNr<=NF; inFldNr++) {
                fldName = $inFldNr
                name2inNr[fldName][inFileNr] = inFldNr
            }
        }
        close(inFileName)
    }

    PROCINFO["sorted_in"] = "@ind_str_asc"
    for (fldName in name2inNr) {
        printf "%s%s", (numOutFlds++ ? OFS : ""), fldName
        for (inFileNr in name2inNr[fldName]) {
            outNr2inNr[numOutFlds][inFileNr] = name2inNr[fldName][inFileNr]
        }
    }
    print ""
}

FNR > 1 {
    for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
        inFldNr = outNr2inNr[outFldNr][ARGIND]
        fldValue = (inFldNr ? $inFldNr : "-")
        printf "%s%s", fldValue, (outFldNr<numOutFlds ? OFS : ORS)
    }
}

$ awk -f tst.awk file{1..3}.dat
a b c d e f g
5 7 2 - - - -
3 9 1 - - - -
2 9 - - 8 3 -
2 8 - - 3 3 -
1 0 - - 3 2 -
1 - 1 5 - - 2

For efficiency the 2nd script above does all the heavy lifting in the BEGIN section so there's as little work left to do as possible in the main body of the script that's evaluated once per input line. In the BEGIN section it creates an associative array (outNr2inNr[]) that maps the outgoing field numbers (alphabetically sorted list of all field names across all input files) to the incoming field numbers so all that's left to do in the body is print the fields in that order.

Combining big data files with different columns into one big file

Tags:

Bash

Awk

R

Dataframe

Related

Recent Posts