SQL - Remove all HTML tags in a string
This is the simplest way.
DECLARE @str VARCHAR(299)
SELECT @str = '<html><head></head><body><p>My text.</p></body></html>'
SELECT cast(@str AS XML).query('.').value('.', 'varchar(200)')
Firstly create a user defined function that strips the HTML out like so:
CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @Start INT;
DECLARE @End INT;
DECLARE @Length INT;
SET @Start = CHARINDEX('<', @HTMLText);
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText));
SET @Length = (@End - @Start) + 1;
WHILE @Start > 0
AND @End > 0
AND @Length > 0
BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '');
SET @Start = CHARINDEX('<', @HTMLText);
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText));
SET @Length = (@End - @Start) + 1;
END;
RETURN LTRIM(RTRIM(@HTMLText));
END;
GO
When you're trying to select it:
SELECT dbo.udf_StripHTML([column]) FROM SOMETABLE
This should lead to you avoiding to have to use several nested replace statements.
Credit and further info: http://blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/
If the HTML is well formed then there's no need to use replace to parse XML.
Just cast or convert it to an XML type and get the value(s).
Here's an example to output the text from all tags:
declare @htmlData nvarchar(100) = '<html>
<head>
</head>
<body>
<p>My text.</p>
<p>My other text.</p>
</body>
</html>';
select convert(XML,@htmlData,1).value('.', 'nvarchar(max)');
select cast(@htmlData as XML).value('.', 'nvarchar(max)');
Note that there's a difference in the output of whitespace between cast and convert.
To only get content from a specific node, the XQuery syntax is used. (XQuery is based on the XPath syntax)
For example:
select cast(@htmlData as XML).value('(//body/p/node())[1]', 'nvarchar(max)');
select convert(XML,@htmlData,1).value('(//body/p/node())[1]', 'nvarchar(max)');
Result : My text.
Of course, this still assumes a valid XML.
If for example, a closing tag is missing then this would raise an XML parsing
error.
If the HTML isn't well formed as an XML, then one could use PATINDEX & SUBSTRING to get the first p tag. And then cast that to an XML type to get the value.
select cast(SUBSTRING(@htmlData,patindex('%<p>%',@htmlData),patindex('%</p>%',@htmlData) - patindex('%<p>%',@htmlData)+4) as xml).value('.','nvarchar(max)');
or via a funky recursive way:
declare @xmlData nvarchar(100);
WITH Lines(n, x, y) AS (
SELECT 1, 1, CHARINDEX(char(13), @htmlData)
UNION ALL
SELECT n+1, y+1, CHARINDEX(char(13), @htmlData, y+1) FROM Lines
WHERE y > 0
)
SELECT @xmlData = concat(@xmlData,SUBSTRING(@htmlData,x,IIF(y>0,y-x,8)))
FROM Lines
where PATINDEX('%<p>%</p>%', SUBSTRING(@htmlData,x,IIF(y>0,y-x,10))) > 0
order by n;
select
@xmlData as xmlData,
convert(XML,@xmlData,1).value('(/p/node())[1]', 'nvarchar(max)') as FirstP;
DECLARE @x XML = '<html><head></head><body><p>My text.</p></body></html>'
SELECT t.c.value('.', 'NVARCHAR(MAX)')
FROM @x.nodes('*') t(c)
Update - For strings with unclosed tags:
DECLARE @x NVARCHAR(MAX) = '<html><head></head><body><p>My text.<br>More text.</p></body></html>'
SELECT x.value('.', 'NVARCHAR(MAX)')
FROM (
SELECT x = CAST(REPLACE(REPLACE(@x, '>', '/>'), '</', '<') AS XML)
) r