What difference does it make matching a word with/without a trailing whitespace?
It's a cheap and error-prone way of doing word matching.
Note that the
with a space after it does not match the word thereby
, so matching with a space after the
avoids matching that string at the start of words. However, it still does match bathe
(if followed by a space), and it does not match the
at the end of a line.
To match the word the
properly (or any other word), you should not use spaces around the word, as that would prevent you from matching it at the start or end of lines or if it's flanked by any other non-word character, such as any punctuation or tab character, for example.
Instead, use a zero-width word boundary pattern:
sed 's/\<the\>/this/'
The \<
and \>
matches the boundaries before and after the word, i.e. the space between a word character and a non-word character. A word character is generally any character matching [[:alnum:]_]
(or [A-Za-z0-9_]
in the POSIX locale).
With GNU sed
, you could also use \b
in place of \<
and \>
:
sed 's/\bthe\b/this/'
The difference is whether there is a space after the
in the input text.
For instance:
With a sentence without a space, no replacement:
$ echo 'theman' | sed 's/the /this /'
theman
With a sentence with a space, works as expected:
$ echo 'the man' | sed 's/the /this /'
this man
With a sentence with another whitespace character, no replacement will occur:
$ echo -e 'the\tman' | sed 's/the /this /'
the man
sed works with regular expressions.
Using sed 's/the /this /'
you just make the space after the
part of the matched pattern.
Using sed 's/the/this/'
you replace all occurrences of the
with this
no matter if a space exists after the
.
In the HackerRank exercise, the result is the same because to replace the with this is logical... you replace just a pro-noun which by default is followed by space (grammar rules).
You can see the difference if you try for example to capitalize the
in the word the theater
:
echo 'the theater' |sed 's/the /THE /g'
THE theater
#theater is ignored since the is not followed by space
echo 'the theater' |sed 's/the/THE/g'
THE THEater
#both the are capitalized.