How do I break up a file like split to stdout for piping to a command?

I think the easiest way to do this is:

while IFS= read -r line; do
  { printf '%s\n' "$line"; head -n 99; } |
  other_commands
done <database_file

You need to use read for the first line in each section as there appears to be no other way to stop when the end of the file is reached. For more information see:

Check if pipe is empty and run a command on the data if it isn't
How to pipe output from one process to another but only execute if the first has output?

Basically, I'm looking for split that will output to stdout, not files.

If you have access to gnu split, the --filter option does exactly that:

‘--filter=command’

    With this option, rather than simply writing to each output file, write
    through a pipe to the specified shell command for each output file.

So in your case, you could either use those commands with --filter, e.g.

split -l 100 --filter='{ cat Header.sql; cat; } | sqlcmd; printf %s\\n DONE' infile

or write a script, e.g. myscript:

#!/bin/sh

{ cat Header.sql; cat; } | sqlcmd
printf %s\\n '--- PROCESSED ---'

and then simply run

split -l 100 --filter=./myscript infile

_linc() ( ${sh-da}sh ${dbg+-vx} 4<&0 <&3 ) 3<<-ARGS 3<<\CMD
        set -- $( [ $((i=${1%%*[!0-9]*}-1)) -gt 1 ] && {
                shift && echo "\${inc=$i}" ; }
        unset cmd ; [ $# -gt 0 ] || cmd='echo incr "#$((i=i+1))" ; cat'
        printf '%s ' 'me=$$ ;' \
        '_cmd() {' '${dbg+set -vx ;}' "$@" "$cmd" '
        }' )
        ARGS
        s= ; sed -f - <<-INC /dev/fd/4 | . /dev/stdin
                i_cmd <<"${s:=${me}SPLIT${me}}"
                ${inc:+$(printf '$!n\n%.0b' `seq $inc`)}
                a$s
        INC
CMD

The above function uses sed to apply its argument list as a command string to an arbitrary line increment. The commands you specify on the command line are sourced into a temporary shell function which is fed a here document on stdin consisting of every increment's step worth of lines.

You use it like this:

time printf 'this is line #%d\n' `seq 1000` |
_linc 193 sed -e \$= -e r \- \| tail -n2
    #output
193
this is line #193
193
this is line #386
193
this is line #579
193
this is line #772
193
this is line #965
35
this is line #1000
printf 'this is line #%d\n' `seq 1000`  0.00s user 0.00s system 0% cpu 0.004 total

The mechanism here is very simple:

i_cmd <<"${s:=${me}SPLIT${me}}"
${inc:+$(printf '$!n\n%.0b' `seq $inc`)}
a$s

That's the sed script. Basically we just printf $increment * n;. So if you set your increment to 100 printf will write you a sed script consisting of 100 lines that say only $!n, one insert line for the top end of the here-doc, and one append for the bottom line - that's it. Most of the rest just handles options.

The next command tells sed to print the current line, delete it, and pull in the next one. The $! specifies that it should only try on any line but the last.

Provided only an incrementer it will:

printf 'this is line #%d\n' `seq 10` |                                  ⏎
_linc 3
    #output
incr #1
this is line #1
this is line #2
this is line #3
incr #2
this is line #4
this is line #5
this is line #6
incr #3
this is line #7
this is line #8
this is line #9
incr #4
this is line #10

So what's happening behind the scenes here is the function is set to echo a counter and cat its input if not provided a command string. If you saw it on the command line it would look like:

{ echo "incr #$((i=i+1))" ; cat ; } <<HEREDOC
this is line #7
this is line #8
this is line #9
HEREDOC

It executes one of these for every increment. Look:

printf 'this is line #%d\n' `seq 10` |
dbg= _linc 3
    #output
set -- ${inc=2}
+ set -- 2
me=$$ ; _cmd() { ${dbg+set -vx ;} echo incr "#$((i=i+1))" ; cat
}
+ me=19396
        s= ; sed -f - <<-INC /dev/fd/4 | . /dev/stdin
                i_cmd <<"${s:=${me}SPLIT${me}}"
                ${inc:+$(printf '$!n\n%.0b' `seq $inc`)}
                a$s
        INC
+ s=
+ . /dev/stdin
+ seq 2
+ printf $!n\n%.0b 1 2
+ sed -f - /dev/fd/4
_cmd <<"19396SPLIT19396"
this is line #1
this is line #2
this is line #3
19396SPLIT19396
+ _cmd
+ set -vx ; echo incr #1
+ cat
this is line #1
this is line #2
this is line #3
_cmd <<"19396SPLIT19396"

REALLY FAST

time yes | sed = | sed -n 'p;n' |
_linc 4000 'printf "current line and char count\n"
    sed "1w /dev/fd/2" | wc -c
    [ $((i=i+1)) -ge 5000 ] && kill "$me" || echo "$i"'

    #OUTPUT

current line and char count
19992001
36000
4999
current line and char count
19996001
36000
current line and char count
[2]    17113 terminated  yes |
       17114 terminated  sed = |
       17115 terminated  sed -n 'p;n'
yes  0.86s user 0.06s system 5% cpu 16.994 total
sed =  9.06s user 0.30s system 55% cpu 16.993 total
sed -n 'p;n'  7.68s user 0.38s system 47% cpu 16.992 total

Above I tell it to increment on every 4000 lines. 17s later and I've processed 20 million lines. Of course the logic isn't serious there - we only read each line twice and count all of their characters, but the possibilities are pretty open. Also if you look closely you might notice it's seemingly the filters providing the input that are taking the majority of the time anyway.

How do I break up a file like split to stdout for piping to a command?

REALLY FAST

Tags:

Shell

Split

Text Processing

Stdout

Related

Recent Posts