How do I break up a file like split to stdout for piping to a command?
I think the easiest way to do this is:
while IFS= read -r line; do
{ printf '%s\n' "$line"; head -n 99; } |
other_commands
done <database_file
You need to use read
for the first line in each section as there appears to be no other way to stop when the end of the file is reached. For more information see:
- Check if pipe is empty and run a command on the data if it isn't
- How to pipe output from one process to another but only execute if the first has output?
Basically, I'm looking for
split
that will output tostdout
, not files.
If you have access to gnu split
, the --filter
option does exactly that:
‘--filter=command’
With this option, rather than simply writing to each output file, write
through a pipe to the specified shell command for each output file.
So in your case, you could either use those commands with --filter
, e.g.
split -l 100 --filter='{ cat Header.sql; cat; } | sqlcmd; printf %s\\n DONE' infile
or write a script, e.g. myscript
:
#!/bin/sh
{ cat Header.sql; cat; } | sqlcmd
printf %s\\n '--- PROCESSED ---'
and then simply run
split -l 100 --filter=./myscript infile
_linc() ( ${sh-da}sh ${dbg+-vx} 4<&0 <&3 ) 3<<-ARGS 3<<\CMD
set -- $( [ $((i=${1%%*[!0-9]*}-1)) -gt 1 ] && {
shift && echo "\${inc=$i}" ; }
unset cmd ; [ $# -gt 0 ] || cmd='echo incr "#$((i=i+1))" ; cat'
printf '%s ' 'me=$$ ;' \
'_cmd() {' '${dbg+set -vx ;}' "$@" "$cmd" '
}' )
ARGS
s= ; sed -f - <<-INC /dev/fd/4 | . /dev/stdin
i_cmd <<"${s:=${me}SPLIT${me}}"
${inc:+$(printf '$!n\n%.0b' `seq $inc`)}
a$s
INC
CMD
The above function uses sed
to apply its argument list as a command string to an arbitrary line increment. The commands you specify on the command line are sourced into a temporary shell function which is fed a here document on stdin consisting of every increment's step worth of lines.
You use it like this:
time printf 'this is line #%d\n' `seq 1000` |
_linc 193 sed -e \$= -e r \- \| tail -n2
#output
193
this is line #193
193
this is line #386
193
this is line #579
193
this is line #772
193
this is line #965
35
this is line #1000
printf 'this is line #%d\n' `seq 1000` 0.00s user 0.00s system 0% cpu 0.004 total
The mechanism here is very simple:
i_cmd <<"${s:=${me}SPLIT${me}}"
${inc:+$(printf '$!n\n%.0b' `seq $inc`)}
a$s
That's the sed
script. Basically we just printf $increment * n;
. So if you set your increment to 100 printf
will write you a sed
script consisting of 100 lines that say only $!n
, one insert
line for the top end of the here-doc, and one append
for the bottom line - that's it. Most of the rest just handles options.
The n
ext command tells sed
to print the current line, delete it, and pull in the next one. The $!
specifies that it should only try on any line but the last.
Provided only an incrementer it will:
printf 'this is line #%d\n' `seq 10` | ⏎
_linc 3
#output
incr #1
this is line #1
this is line #2
this is line #3
incr #2
this is line #4
this is line #5
this is line #6
incr #3
this is line #7
this is line #8
this is line #9
incr #4
this is line #10
So what's happening behind the scenes here is the function is set to echo
a counter and cat
its input if not provided a command string. If you saw it on the command line it would look like:
{ echo "incr #$((i=i+1))" ; cat ; } <<HEREDOC
this is line #7
this is line #8
this is line #9
HEREDOC
It executes one of these for every increment. Look:
printf 'this is line #%d\n' `seq 10` |
dbg= _linc 3
#output
set -- ${inc=2}
+ set -- 2
me=$$ ; _cmd() { ${dbg+set -vx ;} echo incr "#$((i=i+1))" ; cat
}
+ me=19396
s= ; sed -f - <<-INC /dev/fd/4 | . /dev/stdin
i_cmd <<"${s:=${me}SPLIT${me}}"
${inc:+$(printf '$!n\n%.0b' `seq $inc`)}
a$s
INC
+ s=
+ . /dev/stdin
+ seq 2
+ printf $!n\n%.0b 1 2
+ sed -f - /dev/fd/4
_cmd <<"19396SPLIT19396"
this is line #1
this is line #2
this is line #3
19396SPLIT19396
+ _cmd
+ set -vx ; echo incr #1
+ cat
this is line #1
this is line #2
this is line #3
_cmd <<"19396SPLIT19396"
REALLY FAST
time yes | sed = | sed -n 'p;n' |
_linc 4000 'printf "current line and char count\n"
sed "1w /dev/fd/2" | wc -c
[ $((i=i+1)) -ge 5000 ] && kill "$me" || echo "$i"'
#OUTPUT
current line and char count
19992001
36000
4999
current line and char count
19996001
36000
current line and char count
[2] 17113 terminated yes |
17114 terminated sed = |
17115 terminated sed -n 'p;n'
yes 0.86s user 0.06s system 5% cpu 16.994 total
sed = 9.06s user 0.30s system 55% cpu 16.993 total
sed -n 'p;n' 7.68s user 0.38s system 47% cpu 16.992 total
Above I tell it to increment on every 4000 lines. 17s later and I've processed 20 million lines. Of course the logic isn't serious there - we only read each line twice and count all of their characters, but the possibilities are pretty open. Also if you look closely you might notice it's seemingly the filters providing the input that are taking the majority of the time anyway.