Regular expression to remove HTML tags

To turn this:

'<td>mamma</td><td><strong>papa</strong></td>'

into this:

'mamma papa'

You need to replace the tags with spaces:

.replace(/<[^>]*>/g, ' ')

and reduce any duplicate spaces into single spaces:

.replace(/\s{2,}/g, ' ')

then trim away leading and trailing spaces with:

.trim();

Meaning that your remove tag function look like this:

function removeTags(string){
  return string.replace(/<[^>]*>/g, ' ')
               .replace(/\s{2,}/g, ' ')
               .trim();
}

Using a regular expression to parse HTML is fraught with pitfalls. HTML is not a regular language and hence can't be 100% correctly parsed with a regex. This is just one of many problems you will run into. The best approach is to use an HTML / XML parser to do this for you.

Here is a link to a blog post I wrote awhile back which goes into more details about this problem.

  • http://blogs.msdn.com/b/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx

That being said, here's a solution that should fix this particular problem. It in no way is a perfect solution though.

var pattern = @"<(img|a)[^>]*>(?<content>[^<]*)<";
var regex = new Regex(pattern);
var m = regex.Match(sSummary);
if ( m.Success ) { 
  sResult = m.Groups["content"].Value;

Strip off HTML Elements (with/without attributes)

/<\/?[\w\s]*>|<.+[\W]>/g

This will strip off all HTML elements and leave behind the text. This works well even for malformed HTML elements (i.e. elements that are missing closing tags)

Reference and example (Ex.10)


In order to remove also spaces between tags, you can use the following method a combination between regex and a trim for spaces at start and end of the input html:

    public static string StripHtml(string inputHTML)
    {
        const string HTML_MARKUP_REGEX_PATTERN = @"<[^>]+>\s+(?=<)|<[^>]+>";
        inputHTML = WebUtility.HtmlDecode(inputHTML).Trim();

        string noHTML = Regex.Replace(inputHTML, HTML_MARKUP_REGEX_PATTERN, string.Empty);

        return noHTML;
    }

So for the following input:

      <p>     <strong>  <em><span style="text-decoration:underline;background-color:#cc6600;"></span><span style="text-decoration:underline;background-color:#cc6600;color:#663333;"><del>   test text  </del></span></em></strong></p><p><strong><span style="background-color:#999900;"> test 1 </span></strong></p><p><strong><em><span style="background-color:#333366;"> test 2 </span></em></strong></p><p><strong><em><span style="text-decoration:underline;background-color:#006600;"> test 3 </span></em></strong></p>      

The output will be only the text without spaces between html tags or space before or after html: "   test text   test 1  test 2  test 3 ".

Please notice that the spaces before test text are from the <del> test text </del> html and the space after test 3 is from the <em><span style="text-decoration:underline;background-color:#006600;"> test 3 </span></em></strong></p> html.

Tags:

C#

.Net

Regex