How to avoid surrounding html head tags in Jsoup parse
You can try using the XML parser, but this doesn't always work because HTML is not always XML; it often has unterminated tags like <img>
and <br>
. It's better to stick with the HTML parser. You can rely on there being <html>
, <head>
, and <body>
tags and they are easy to discard. Just get your fragment of HTML by selecting the body tag and ask for its HTML.
Document doc = Jsoup.parseBodyFragment(html);
doc.outputSettings().prettyPrint(false);
System.out.println(doc.select("body").html());
To get the expected output it would actually be:
final String html = "<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>";
Document doc = Jsoup.parseBodyFragment(html);
doc.outputSettings().prettyPrint(false);
System.out.println(doc.body().html());
The cause:
parseBodyFragment()
as well as all other parse()
-methods use a HTML parser by default. And those add always the HTML-Shell (<html>…</html>
, <head>…</head>
etc.).
The Solution:
Just don't use a HTML-parser, use a XML-parser instead ;-)
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
Replace that single line and your problem is solved.
Example:
final String html = "<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>";
Document docHtml = Jsoup.parse(html);
Document docXml = Jsoup.parse(html, "", Parser.xmlParser());
System.out.println("******* HTML *******\n" + docHtml);
System.out.println();
System.out.println("******* XML *******\n" + docXml);
Output:
******* HTML *******
<html>
<head></head>
<body>
<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>
</body>
</html>
******* XML *******
<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>