Regular expression for parsing links from a webpage?
from the RegexBuddy library:
URL: Find in full text
The final character class makes sure that if an URL is part of some text, punctuation such as a comma or full stop after the URL is not interpreted as part of the URL.
\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]
With Html Agility Pack, you can use:
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a@href")
{
Response.Write(link["href"].Value);
}
doc.Save("file.htm");
((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)
I took this from regexlib.com
[editor's note: the {1} has no real function in this regex; see this post]