How to grep-inverse-match and exclude "before" and "after" lines
You could use gnu grep
with -A
and -B
to print exactly the parts of the file you want to exclude but add the -n
switch to also print the line numbers and then format the output and pass it as a command script to sed
to delete those lines:
grep -n -A1 -B2 PATTERN infile | \
sed -n 's/^\([0-9]\{1,\}\).*/\1d/p' | \
sed -f - infile
This should also work with files of patterns passed to grep
via -f
e.g.:
grep -n -A1 -B2 -f patterns infile | \
sed -n 's/^\([0-9]\{1,\}\).*/\1d/p' | \
sed -f - infile
I think this could be slightly optimized if it collapsed any three or more consecutive line numbers into ranges so as to have e.g. 2,6d
instead of 2d;3d;4d;5d;6d
... though if the input has only a few matches it's not worth doing it.
Other ways that don't preserve line order and are most likely slower:
with comm
:
comm -13 <(grep PATTERN -A1 -B2 <(nl -ba -nrz -s: infile) | sort) \
<(nl -ba -nrz -s: infile | sort) | cut -d: -f2-
comm
requires sorted input which means the line order would not be preserved in the final output (unless your file is already sorted) so nl
is used to number the lines before sorting, comm -13
prints only lines unique to 2nd FILE and then cut
removes the part that was added by nl
(that is, the first field and the delimiter :
)
with join
:
join -t: -j1 -v1 <(nl -ba -nrz -s: infile | sort) \
<(grep PATTERN -A1 -B2 <(nl -ba -nrz -s: infile) | sort) | cut -d: -f2-
don's might be better in most cases, but just in case the file is really big, and you can't get sed
to handle a script file that large (which can happen at around 5000+ lines of script), here it is with plain sed
:
sed -ne:t -e"/\n.*$match/D" \
-e'$!N;//D;/'"$match/{" \
-e"s/\n/&/$A;t" \
-e'$q;bt' -e\} \
-e's/\n/&/'"$B;tP" \
-e'$!bt' -e:P -e'P;D'
This is an example of what is called a sliding window on input. It works by building a look-ahead buffer of $B
-count lines before ever attempting to print anything.
And actually, probably I should clarify my previous point: the primary performance limiter for both this solution and don's will be directly related to interval. This solution will slow with larger interval sizes, whereas don's will slow with larger interval frequencies. In other words, even if the input file is very large, if the actual interval occurrence is still very infrequent then his solution is probably the way to go. However, if the interval size is relatively manageable, and is likely to occur often, then this is the solution you should choose.
So here's the workflow:
- If
$match
is found in pattern space preceded by a\n
ewline,sed
will recursivelyD
elete every\n
ewline that precedes it.- I was clearing
$match
's pattern space out completely before - but to easily handle overlap, leaving a landmark seems to work far better. - I also tried
s/.*\n.*\($match\)/\1/
to try to get it in one go and dodge the loop, but when$A/$B
are large, theD
elete loop proves considerably faster.
- I was clearing
- Then we pull in the
N
ext line of input preceded by a\n
ewline delimiter and try once again toD
elete a/\n.*$match/
once again by referring to our most recently used regular expression w///
. - If pattern space matches
$match
then it can only do so with$match
at the head of the line - all$B
efore lines have been cleared.- So we start looping over
$A
fter. - Each run of this loop we'll attempt to
s///
ubstitute for&
itself the$A
th\n
ewline character in pattern space, and, if successful,t
est will branch us - and our whole$A
fter buffer - out of the script entirely to start the script over from the top with the next input line if any. - If the
t
est is not successful we'llb
ranch back to the:t
op label and recurse for another line of input - possibly starting the loop over if$match
occurs while gathering$A
fter.
- So we start looping over
- If we get past a
$match
function loop, then we'll try top
rint the$
last line if this is it, and if!
not try tos///
ubstitute for&
itself the$B
th\n
ewline character in pattern space.- We'll
t
est this, too, and if it is successful we'll branch to the:P
rint label. - If not we'll branch back to
:t
op and get another input line appended to the buffer.
- We'll
- If we make it to
:P
rint we'llP
rint thenD
elete up to the first\n
ewline in pattern space and rerun the script from the top with what remains.
And so this time, if we were doing A=2 B=2 match=5; seq 5 | sed...
The pattern space for the first iteration at :P
rint would look like:
^1\n2\n3$
And that's how sed
gathers its $B
efore buffer. And so sed
prints to output $B
-count lines behind the input it has gathered. This means that, given our previous example, sed
would P
rint 1
to output, and then D
elete that and send back to the top of the script a pattern space which looks like:
^2\n3$
...and at the top of the script the N
ext input line is retrieved and so the next iteration looks like:
^2\n3\n4$
And so when we find the first occurrence of 5
in input, the pattern space actually looks like:
^3\n4\n5$
Then the D
elete loop kicks in and when it's through it looks like:
^5$
And when the N
ext input line is pulled sed
hits EOF and quits. By that time it has only ever P
rinted lines 1 and 2.
Here's an example run:
A=8 B=7 match='[24689]0'
seq 100 |
sed -ne:t -e"/\n.*$match/D" \
-e'$!N;//D;/'"$match/{" \
-e"s/\n/&/$A;t" \
-e'$q;bt' -e\} \
-e's/\n/&/'"$B;tP" \
-e'$!bt' -e:P -e'P;D'
That prints:
1
2
3
4
5
6
7
8
9
10
11
12
29
30
31
32
49
50
51
52
69
70
71
72
99
100
If you don't mind using vim
:
$ export PAT=fff A=1 B=2
$ vim -Nes "+g/${PAT}/.-${B},.+${A}d" '+w !tee' '+q!' foo
aaa
bbb
ccc
hhh
iii
-Nes
turns on non-compatible, silent ex mode. Useful for scripting.+{command}
tell vim to run{command}
on the file.g/${PAT}/
- on all lines matching/fff/
. This gets tricky if the pattern contains regular expression special characters that you didn't intend to treat that way..-${B}
- from 1 line above this one.+${A}
- to 2 lines below this one (see:he cmdline-ranges
for these two)d
- delete the lines.+w !tee
then writes to standard output.+q!
quits without saving changes.
You can skip the variables and use the pattern and numbers directly. I used them just for clarity of purpose.