Join lines of text with repeated beginning

This is standard procedure for awk

awk '
{
  k=$2
  for (i=3;i<=NF;i++)
    k=k " " $i
  if (! a[$1])
    a[$1]=k
  else
    a[$1]=a[$1] "<br>" k
}
END{
  for (i in a)
    print i "\t" a[i]
}' long.text.file

If file is sorted by first word in line the script can be more simple

awk '
{
  if($1==k)
    printf("%s","<br>")
  else {
    if(NR!=1)
      print ""
    printf("%s\t",$1)
  }
  for(i=2;i<NF;i++)
    printf("%s ",$i)
  printf("%s",$NF)
  k=$1
}
END{
print ""
}' long.text.file

Or just bash

unset n
while read -r word definition
do
    if [ "$last" = "$word" ]
    then
        printf "<br>%s" "$definition"
    else 
        if [ "$n" ]
        then
            echo
        else
            n=1
        fi
        printf "%s\t%s" "$word" "$definition"
        last="$word"
     fi
done < long.text.file
echo

perl -p0E 'while(s/^((.+?)\t.*)\n\2\t/$1<br>/gm){}'

(It takes 2s to process a 23MB, 1.5Mlines dictionary, in my 6years old laptop)

With sed:

sed '$!N;/^\([^\t]*\t\)\(.*\)\(\n\)\1/!P;s//\3\1\2<br>/;D' <<\IN
word1  some text
word1  some other text
word1  some other other text
word2  more text
word3  even more
word3  and still more
IN

(note: with many seds the above \t escape is invalid and a literal <tab> character should be used in its place)

And if you have GNU sed you can write it a little easier:

sed -E '$!N;/^(\S+\t)(.*)\n\1/!P;s//\n\1\2<br>/;D' <infile

It works by gradually stacking input as it is read. If two consecutive lines do not begin with the same not-space string, then the first of these is Printed. Else the intervening newline is relocated to the head of the line and the matched string immediately following it (to include the tab) is replaced w/ the string <br>.

Note that the stacking method used here could have performance implications if the line that sed assembles grows very long. If it grows any longer than 8kb then it will exceed the minimum pattern space buffer-size specified by POSIX.

Regardless of which of the two possibilities occurred, last of all sed Deletes up to the first occurring \newline character in pattern space and starts over with what remains. And so when two consecutive lines do not begin with identical strings then the first is printed and deleted, else the substitution is performed and the Delete only deletes the \newline which formerly separated them.

And so the command above prints:

word1  some text<br>some other text<br>some other other text
word2  more text
word3  even more<br>and still more

I used a <<\HERE_DOC for input above, but you should probably drop everything from <<\IN on and use </path/to/infile instead.

Join lines of text with repeated beginning

Tags:

Command Line

Text Processing

Related

Recent Posts