Join lines of text with repeated beginning
This is standard procedure for awk
awk '
{
k=$2
for (i=3;i<=NF;i++)
k=k " " $i
if (! a[$1])
a[$1]=k
else
a[$1]=a[$1] "<br>" k
}
END{
for (i in a)
print i "\t" a[i]
}' long.text.file
If file is sorted by first word in line the script can be more simple
awk '
{
if($1==k)
printf("%s","<br>")
else {
if(NR!=1)
print ""
printf("%s\t",$1)
}
for(i=2;i<NF;i++)
printf("%s ",$i)
printf("%s",$NF)
k=$1
}
END{
print ""
}' long.text.file
Or just bash
unset n
while read -r word definition
do
if [ "$last" = "$word" ]
then
printf "<br>%s" "$definition"
else
if [ "$n" ]
then
echo
else
n=1
fi
printf "%s\t%s" "$word" "$definition"
last="$word"
fi
done < long.text.file
echo
perl -p0E 'while(s/^((.+?)\t.*)\n\2\t/$1<br>/gm){}'
(It takes 2s to process a 23MB, 1.5Mlines dictionary, in my 6years old laptop)
With sed
:
sed '$!N;/^\([^\t]*\t\)\(.*\)\(\n\)\1/!P;s//\3\1\2<br>/;D' <<\IN
word1 some text
word1 some other text
word1 some other other text
word2 more text
word3 even more
word3 and still more
IN
(note: with many sed
s the above \t
escape is invalid and a literal <tab>
character should be used in its place)
And if you have GNU sed
you can write it a little easier:
sed -E '$!N;/^(\S+\t)(.*)\n\1/!P;s//\n\1\2<br>/;D' <infile
It works by gradually stacking input as it is read. If two consecutive lines do not begin with the same not-space string, then the first of these is P
rinted. Else the intervening newline is relocated to the head of the line and the matched string immediately following it (to include the tab) is replaced w/ the string <br>
.
Note that the stacking method used here could have performance implications if the line that sed
assembles grows very long. If it grows any longer than 8kb then it will exceed the minimum pattern space buffer-size specified by POSIX.
Regardless of which of the two possibilities occurred, last of all sed
D
eletes up to the first occurring \n
ewline character in pattern space and starts over with what remains. And so when two consecutive lines do not begin with identical strings then the first is printed and deleted, else the substitution is performed and the D
elete only deletes the \n
ewline which formerly separated them.
And so the command above prints:
word1 some text<br>some other text<br>some other other text
word2 more text
word3 even more<br>and still more
I used a <<\HERE_DOC
for input above, but you should probably drop everything from <<\IN
on and use </path/to/infile
instead.