Is there a field that stores the exact field separator FS used when in a regular expression, equivalent to RT for RS?
Is there a way to "repack" the fields using the specific field separator used to split each one of them
Using gnu-awk
split()
that has an extra 4th parameter for the matched delimiter using supplied regex:
s="hello;how|are you"
awk 'split($0, flds, /[;|]/, seps) {for (i=1; i in seps; i++) printf "%s%s", flds[i], seps[i]; print flds[i]}' <<< "$s"
hello;how|are you
A more readable version:
s="hello;how|are you"
awk 'split($0, flds, /[;|]/, seps) {
for (i=1; i in seps; i++)
printf "%s%s", flds[i], seps[i]
print flds[i]
}' <<< "$s"
Take note of 4th seps
parameter in split
that stores an array of matched text by regular expression used in 3rd parameter i.e. /[;|]/
.
Of course it is not as short & simple as RS
, ORS
and RT
, which can be written as:
awk -v RS='[;|]' '{ORS = RT} 1' <<< "$s"
As @anubhava mentions, gawk has split()
(and patsplit()
which is to FPAT
as split()
is to FS
- see https://www.gnu.org/software/gawk/manual/gawk.html#String-Functions) to do what you want. If you want the same functionality with a POSIX awk then:
$ cat tst.awk
function getFldsSeps(str,flds,fs,seps, nf) {
delete flds
delete seps
str = $0
if ( fs == " " ) {
fs = "[[:space:]]+"
if ( match(str,"^"fs) ) {
seps[0] = substr(str,RSTART,RLENGTH)
str = substr(str,RSTART+RLENGTH)
}
}
while ( match(str,fs) ) {
flds[++nf] = substr(str,1,RSTART-1)
seps[nf] = substr(str,RSTART,RLENGTH)
str = substr(str,RSTART+RLENGTH)
}
if ( str != "" ) {
flds[++nf] = str
}
return nf
}
{
print
nf = getFldsSeps($0,flds,FS,seps)
for (i=0; i<=nf; i++) {
printf "{%d:[%s]<%s>}%s", i, flds[i], seps[i], (i<nf ? "" : ORS)
}
}
Note the specific handling above of the case where the field separator is " "
because that means 2 things different from all other field separator values:
- Fields are actually separated by chains of any white space, and
- Leading white space is to be ignored when populating $1 (or flds[1] in this case) and so that white space, if it exists, must be captured in seps[0]` for our purposes since every seps[N] is associated with the flds[N] that precedes it.
For example, running the above on these 3 input files:
$ head file{1..3}
==> file1 <==
hello;how|are you
==> file2 <==
hello how are_you
==> file3 <==
hello how are_you
we'd get the following output where each field is displayed as the field number then the field value within [...]
then the separator within <...>
, all within {...}
(note that seps[0]
is populated IFF the FS is " "
and the record starts with white space):
$ awk -F'[,|]' -f tst.awk file1
hello;how|are you
{0:[]<>}{1:[hello;how]<|>}{2:[are you]<>}
$ awk -f tst.awk file2
hello how are_you
{0:[]<>}{1:[hello]< >}{2:[how]< >}{3:[are_you]<>}
$ awk -f tst.awk file3
hello how are_you
{0:[]< >}{1:[hello]< >}{2:[how]< >}{3:[are_you]<>}
An alternative option to split is to use match to find the field separators and read them into an array:
awk -F'[;|]' '{
str=$0; # Set str to the line
while (match(str,FS)) { # Loop through rach match of the field separator
map[cnt+=1]=substr(str,RSTART,RLENGTH); # Create an array of the field separators
str=substr(str,RSTART+RLENGTH) # Set str to the rest of the string after the match string
}
for (i=1;i<=NF;i++) {
printf "%s%s",$i,map[i] # Loop through each record, printing it along with the field separator held in the array map.
}
printf "\n"
}' <<< "hello;how|are you"