Strip all HTML attributes except for src

You usually should not parse HTML using regular expressions.

Instead, you should call DOMDocument::loadHTML.
You can then recurse through the elements in the document and call removeAttribute.


Alright, here's what I used that seems to be working well:

<([A-Z][A-Z0-9]*)(\b[^>src]*)(src\=[\'|"|\s]?[^\'][^"][^\s]*[\'|"|\s]?)?(\b[^>]*)>

Feel free to poke any holes in it.


This might work for your needs:

$text = '<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>';

echo preg_replace("/<([a-z][a-z0-9]*)(?:[^>]*(\ssrc=['\"][^'\"]*['\"]))?[^>]*?(\/?)>/i",'<$1$2$3>', $text);

// <p>This is a paragraph with an image <img src="/path/to/image.jpg"/></p>

The RegExp broken down:

/              # Start Pattern
 <             # Match '<' at beginning of tags
 (             # Start Capture Group $1 - Tag Name
  [a-z]         # Match 'a' through 'z'
  [a-z0-9]*     # Match 'a' through 'z' or '0' through '9' zero or more times
 )             # End Capture Group
 (?:           # Start Non-Capture Group
  [^>]*         # Match anything other than '>', Zero or More Times
  (             # Start Capture Group $2 - ' src="...."'
   \s            # Match one whitespace
   src=          # Match 'src='
   ['"]          # Match ' or "
   [^'"]*        # Match anything other than ' or " 
   ['"]          # Match ' or "
  )             # End Capture Group 2
 )?            # End Non-Capture Group, match group zero or one time
 [^>]*?        # Match anything other than '>', Zero or More times, not-greedy (wont eat the /)
 (\/?)         # Capture Group $3 - '/' if it is there
 >             # Match '>'
/i            # End Pattern - Case Insensitive

Add some quoting, and use the replacement text <$1$2$3> it should strip any non src= properties from well-formed HTML tags.

Please Note This isn't necessarily going to work on ALL input, as the Anti-HTML + RegExp people are so cleverly noting below. There are a few fallbacks, most notably <p style=">"> would end up <p>"> and a few other broken issues... I would recommend looking at Zend_Filter_StripTags as a full proof tags/attributes filter in PHP