Removing html tags from a string in R

Another approach, using tm.plugin.webmining, which uses XML internally.

> library(tm.plugin.webmining)
> extractHTMLStrip("junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk")
[1] "junk junk junk junk"

This can be achieved simply through regular expressions and the grep family:

cleanFun <- function(htmlString) {
  return(gsub("<.*?>", "", htmlString))
}

This will also work with multiple html tags in the same string!

This finds any instances of the pattern <.*?> in the htmlString and replaces it with the empty string "". The ? in .*? makes it non greedy, so if you have multiple tags (e.g., <a> junk </a>) it will match <a> and </a> instead of the whole string.


You can also do this with two functions in the rvest package:

library(rvest)

strip_html <- function(s) {
    html_text(read_html(s))
}

Example output:

> strip_html("junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk")
[1] "junk junk junk junk"

Note that you should not use regexes to parse HTML.

Tags:

String

R