How to use grep and cut in script to obtain website URLs from an HTML file
As I said in my comment, it's generally not a good idea to parse HTML with Regular Expressions, but you can sometimes get away with it if the HTML you're parsing is well-behaved.
In order to only get URLs that are in the href
attribute of <a>
elements, I find it easiest to do it in multiple stages. From your comments, it looks like you only want the top level domain, not the full URL. In that case you can use something like this:
grep -Eoi '<a [^>]+>' source.html |
grep -Eo 'href="[^\"]+"' |
grep -Eo '(http|https)://[^/"]+'
where source.html
is the file containing the HTML code to parse.
This code will print all top-level URLs that occur as the href
attribute of any <a>
elements in each line. The -i
option to the first grep
command is to ensure that it will work on both <a>
and <A>
elements. I guess you could also give -i
to the 2nd grep
to capture upper case HREF
attributes, OTOH, I'd prefer to ignore such broken HTML. :)
To process the contents of http://google.com/
wget -qO- http://google.com/ |
grep -Eoi '<a [^>]+>' |
grep -Eo 'href="[^\"]+"' |
grep -Eo '(http|https)://[^/"]+'
output
http://www.google.com.au
http://maps.google.com.au
https://play.google.com
http://www.youtube.com
http://news.google.com.au
https://mail.google.com
https://drive.google.com
http://www.google.com.au
http://www.google.com.au
https://accounts.google.com
http://www.google.com.au
https://www.google.com
https://plus.google.com
http://www.google.com.au
My output is a little different from the other examples as I get redirected to the Australian Google page.
Not sure if you are limited on tools:
But regex might not be the best way to go as mentioned, but here is an example that I put together:
cat urls.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_%:-]*" | sort -u
grep -E
: is the same as egrepgrep -o
: only outputs what has been grepped(http|https)
: is an either / ora-z
: is all lower caseA-Z
: is all upper case.
: is dot/
: is the slash?
: is ?=
: is equal sign_
: is underscore%
: is percentage sign:
: is colon-
: is dash*
: is repeat the [...] groupsort -u
: will sort & remove any duplicates
Output:
bob@bob-NE722:~s$ wget -qO- https://stackoverflow.com/ | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
https://stackauth.com
https://meta.stackoverflow.com
https://cdn.sstatic.net/Img/svg-icons
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head
https://stackoverflow.com/users/signup?ssrc=head
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
...
You can also add in \d
to catch other numeral types.
If your grep supports Perl regexes:
grep -Po '(?<=href=")[^"]*(?=")'
(?<=href=")
and(?=")
are lookaround expressions for thehref
attribute. This needs the-P
option.-o
prints the matching text.
For example:
$ curl -sL https://www.google.com | grep -Po '(?<=href=")[^"]*(?=")'
/search?
https://www.google.co.in/imghp?hl=en&tab=wi
https://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
https://www.youtube.com/?gl=IN&tab=w1
https://news.google.co.in/nwshp?hl=en&tab=wn
...
As usual, there's no guarantee that these are valid URIs, or that the HTML you're parsing will be valid.