Deleting a block of lines starting from a specific word until the next similar block (the next similar "section header")
awk is usually easier to read and understand:
Here is a simple program that writes by default, and toggle a "wewrite" to "0" (= off, we will not write) when it sees a line where the first word is "gene", and put it back on when he sees a line where the first word is "CDS" or "mRNA" :
awk '
BEGIN { weprint=1 }
( $1 == "gene" ) { weprint=0 }
( $1 == "CDS" ) || ( $1 == "mRNA" ) { weprint=1 }
( weprint == 1) { print $0 ;}
' file_to_read
BEGIN is done before any lines are read.
The other ( test ) { action if test successful }
are parsed for each line of input (... unless an action contains next
, which then would ignore the rest of those and instead would go fetch the next line of input)
This will only print sections "CDS" and "mRNA" and not "gene"
This could be "golfed" (for example, the default action for a successfull 'test' is to print $0, so you could have just ( weprint == 1)
as the last line, but it would be less clear to grasp, imo...)
sed -e '
/^ *gene /!b # print non-gene block begin lines
:a
$d; N # do-while loop accumulates lines for gene block
s/\n *\///;ta
D # clip the gene block
' yourfile
You need to realize that the sed
model is to read a file on a per-line
basis, and sed
command in the -e
section is applied in sequence on the
line as it get's transformed unless there are branching
instructions
involved. And a basic syntax of sed
is address command
where command can
be any valid sed
command and address
can be either of these: linenum
,
$
(= last line), regex
, range of addresses
, and finally nothing meaning this gets
applied to ALL lines. Note that lines are stored in a register called the pattern space
.
So with that basic stuff out of the way, we go to the actual sed
-e
code
at hand:
b
=> branch to the end of sed code and print the pattern space. This means we keep printing any line that does NOT (the !
after the address pattern) have the string gene
as it's first field.
When we finally hit the gene
in the first field line, we set up a do-while loop (:a
sets a mark to be jumped to) to keep accumulating the lines into the pattern space register (N
appends the next line; s
command removes \n *\/
, which is the line break, followed by spaces and a /
) till the time either of the 2 conditions are not met, viz., either we hit the eof => we delete it ($d
=> delete pattern space if we are at the last line) since this a gene block which appeared towards the eof and must go.
OR we hit the beginning of next block: if s
could find and remove the said pattern, the t
will jump to :a
, otherwise (a new block, so the pattern was not found), we continue. Now the pattern space holds the whole of the gene block and the first line of the next block. We promptly delete the gene block and with the beginning of the next block we go to the top of sed code (that's what the D
command does).