Extract number of length n from field and return string

If I understand correctly, you want the 5th column to become the concatenation with space of all the 6 digit numbers in it.

Maybe:

perl -F'\t' -lape '
   $F[4] = join " ", grep {length == 6} ($F[4] =~ /\d+/g);
   $_ = join "\t", @F' < file

Or reusing your negative look around operators:

perl -F'\t' -lape '
   $F[4] = join " ", ($F[4] =~ /(?<!\d)\d{6}(?!\d)/g);
   $_ = join "\t", @F' < file

With awk:

awk -F'\t' -v OFS='\t' '
  {
    repl = sep = ""
    while (match($5, /[0-9]+/)) {
      if (RLENGTH == 6) {
        repl = repl sep substr($5, RSTART, RLENGTH)
        sep = " "
      }
      $5 = substr($5, RSTART+RLENGTH)
    }
    $5 = repl
    print
  }' < file

grep itself is not very adequate for the task. grep is meant to print the lines that match a pattern. Some implementations like GNU or ast-open grep, or pcregrep can extract strings from the matching lines, but that's quite limited.

The only cut+grep+paste approach I can think of that could work with some restrictions would be with the pcregrep grep implementation:

n='(?:.*?((?1)))?'
paste <(< file cut -f1-4) <(< file cut -f5 |
  pcregrep --om-separator=" " -o1 -o2 -o3 -o4 -o5 -o6 -o7 -o8 -o9 \
    "((?<!\d)\d{6}(?!\d))$n$n$n$n$n$n$n$n"
  ) <(< file cut -f6-)

That assumes that every line of input has at least 6 fields and that the 5th field of each has in between 1 and 9 6-digit numbers.

Extract number of length n from field and return string

Tags:

Grep

Text Processing

Bioinformatics

Related

Recent Posts