Is there way to remove not all, but only nested brackets?
bracket.awk
:
BEGIN{quote=1}
{
for(i=1;i<=length;i++){
ch=substr($0,i,1)
pr=1
if(ch=="\""){quote=!quote}
else if(ch=="[" && quote){brk++;pr=brk<2}
else if(ch=="]" && quote){brk--;pr=brk<1}
if(pr){printf "%s",ch}
}
print ""
}
$ awk -f bracket.awk file
["q", "0", "R", "L"], ["q", "1", "[", "]"], ["q", "2", "L", "R"], ["q", "3", "R", "L"]
The idea behind it:
Initialize quote=1
. Read the file char-wise. Whenever a quote is found, invert quote
variable (if 1
, it becomes 0
, and vice-versa).
Then, brackets are only counted if quote
is set to 1 and excess brackets are not printed, according to brk
counter.
The print ""
statement is just to add a newline, as the printf
above does not do it.
With perl
:
perl -pe '
s{([^]["]+|"[^"]*")|\[(?0)*\]}
{$1 // "[". ($& =~ s/("[^"]*"|[^]["]+)|./$1/gr) . "]"}ge'
That makes use of perl
's recursive regexp.
The outer s{regex}{replacement-code}ge
tokenises the input into either:
- any sequence of characters other than
[
,]
or"
- a quoted string
- a
[...]
group (using recursion in the regexp to find the matching]
)
Then, we replace that token with itself if it's in the first two categories ($1
), and if not the token with the non-quoted [
, ]
removed using the same tokenising technique in the inner substitution.
To handle escaped quotes and backslashes within quotes (like "foo\"bar\\"
), replace [^"]
with (?:[^\\"]|\\.)
.
With sed
If your sed
supports the -E
or -r
options to work with extended regexps instead of basic ones, you could do it with a loop, replacing the innermost [...]
s first:
LC_ALL=C sed -E '
:1
s/^(("[^"]*"|[^"])*\[("[^"]*"|[^]"])*)\[(("[^"]*"|[^]["])*)\]/\1\4/
t1'
(using LC_ALL=C
to speed it up and make it equivalent to the perl
one which also ignores the user's locale when it comes to interpreting bytes as characters).
POSIXly, it should still be doable with something like:
LC_ALL=C sed '
:1
s/^\(\(\("[^"]*"\)*[^"]*\)*\[\(\("[^"]*"\)*[^]"]*\)*\)\[\(\(\("[^"]*"\)*[^]["]*\)*\)\]/\1\6/
t1'
Here using \(\(a\)*\(b\)*\)*
in place of (a|b)*
as basic regexps don't have an alternation operator (the BREs of some sed
implementations have \|
for that, but that's not POSIX/portable).
This gawk
is inelegant to say the least, it will break if you even look at it too long, so you don't need to tell me........ just have a quiet and self-satisfied chuckle that you can do better.
But as it more or less works (on Wednesdays and Fridays during months with a J
in them) and consumed 20 minutes of my life I am posting it anyway
Schroedinger's awk
(Thx @edmorton)
awk -F"\\\], \\\[" '
{printf "[";
for (i=1; i<=NF; i++) {
cs=split($i,c,",");
for (j=1; j<=cs; j++){
sub("^ *\\[+","",c[j]); sub("\\]+$","",c[j]);
t=(j==cs)?"]"((i<(NF-1))?", [":""):",";
printf c[j] t
}}print ""}' file
["q", "0", "R", "L"], ["q","1", "[", "]"], ["q","2", "L", "R"], ["q","3","R", "L"]
Walkthrough
Split the fields -F
on ], [
which needs to be escaped to hell and back in order to get your final element groups in the fields.
Then split
on ,
to get the elements and consume any leading ^[
or trailing ]$
from each element, then re-aggregate the split
with ,
as a separator and finally re-aggregate the fields using a conditional combination of ]
and , [
.
Heisenberg's sed
If you pipe to sed
it's slightly tidier
awk 'BEGIN{FS="\\], \\["}{for (i=1; i<=NF; i++) print $i}' file |
sed -E "s/(^| |,)\[+(\")/\1\2/g ;s/\]+(,|$)/\1/g" |
awk 'BEGIN{RS=""; FS="\n";OFS="], ["}{$1=$1; print "["$0"]"}'
["q", "0", "R", "L"], ["q", "1", "[", "]"], ["q", "2", "L", "R"], ["q", "3", "R", "L"]
Does the same job as the first version, the first awk
splits out the fields as before, sed
loses the excess [
and ]
and the final awk
recomposes the elements by redefining RS
, FS
and OFS