Unable to make the mentioned regular expression to work in sed command
After working for long, I made my sed command to work. Below is the command which worked.
sed -E 's@^[^<]?(https?://(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&=]*))[^>]?$@<\1>@gm;t' websites.txt > output.txt
You can find the sample implementation of the command in here.
Since, the regex has already fulfilled the requirement of the person for whom I'm writing this requirement for; I needed to get help only regarding the command syntax (although any improvements are heartily welcomed); I want the command to work with the same regular expression pattern.
Things which I was unaware previously and learnt now:
I didn't knew anything about
-E
flag. Now I know; that-E
uses POSIX "extended" syntax ("ERE"). Thanks to @GordonDavisson and @Sundeep. Further reading.I didn't know with clarity that sed doesn't supports look-around. But now I know sed doesn't support look-around. Thanks to @dmitri-chubarov. Further reading
I didn't knew sed doesn't support non-capturing groups too. Thanks to @Sundeep for solving this part. Further Reading
I didn't knew about GNU sed as a specific command line tool. Thanks to @oguzismail for this. Further reading.
With respect to the command in your answer:
sed -E 's@^[^<]?(https?://(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&=]*))[^>]?$@<\1>@gm;t'
Here's a few notes:
Your posted sample input has 1 URL per line so AFAIK the gm;t
at the end of your sed command is doing nothing useful so either your input is inadequate or your script is wrong.
The hard-coded ranges a-z
, A-Z
, and 0-9
include different characters in different locales. If you meant to include all (and only) lower case letters, upper case letters, and digits then you should replace a-zA-Z0-9
with the POSIX character class [:alnum:]
. So either change to use a locale-independent character class or specify the locale you need on your command line depending in your requirements for which characters to match in your regexp.
Like most characters, the character +
is literal inside a bracket expression so it shouldn't be escaped - change \+
to just +
.
The bracket expression [^<]?
means "1 or 0 occurrences of any character that is not a <
" and similarly for [^>]?
so if your "url" contained random characters at the start/end it'd be accepted, e.g.:
echo 'xhttp://foo.bar%' | sed -E 's@^[^<]?(https?://(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&=]*))[^>]?$@<\1>@gm;t'
<http://foo.bar%>
I think you meant to use <?
and >?
instead of [^<]?
and [^>]?
.
Your regexp would allow a "url" that has no letters:
echo 'http://=.9' | gsed -E 's@^[^<]?(https?://(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&=]*))[^>]?$@<\1>@gm;t'
<http://=.9>
If you edit your question to provide more truly representative sample input and expected output (including cases you do not want to match) then we can help you BUT based on a quick google of what a valid URL is it looks like there are several valid URLs that'd be disallowed by your regexp and several invalid ones that'd be allowed so you might want to ask about that in a question tagged with url
or similar (with the tags you currently have we can help you implement your regexp but there may be better people to help with defining your regexp).