Reading entire html file to String?
You should use a StringBuilder:
StringBuilder contentBuilder = new StringBuilder();
try {
BufferedReader in = new BufferedReader(new FileReader("mypage.html"));
String str;
while ((str = in.readLine()) != null) {
contentBuilder.append(str);
}
in.close();
} catch (IOException e) {
}
String content = contentBuilder.toString();
As Jean mentioned, using a StringBuilder
instead of +=
would be better. But if you're looking for something simpler, Guava, IOUtils, and Jsoup are all good options.
Example with Guava:
String content = Files.asCharSource(new File("/path/to/mypage.html"), StandardCharsets.UTF_8).read();
Example with IOUtils:
InputStream in = new URL("/path/to/mypage.html").openStream();
String content;
try {
content = IOUtils.toString(in, StandardCharsets.UTF_8);
} finally {
IOUtils.closeQuietly(in);
}
Example with Jsoup:
String content = Jsoup.parse(new File("/path/to/mypage.html"), "UTF-8").toString();
or
String content = Jsoup.parse(new File("/path/to/mypage.html"), "UTF-8").outerHtml();
NOTES:
Files.readLines()
andFiles.toString()
These are now deprecated as of Guava release version 22.0 (May 22, 2017).
Files.asCharSource()
should be used instead as seen in the example above. (version 22.0 release diffs)
IOUtils.toString(InputStream)
andCharsets.UTF_8
Deprecated as of Apache Commons-IO version 2.5 (May 6, 2016). IOUtils.toString
should now be passed the InputStream
and the Charset
as seen in the example above. Java 7's StandardCharsets
should be used instead of Charsets
as seen in the example above. (deprecated Charsets.UTF_8)
There's the IOUtils.toString(..)
utility from Apache Commons.
If you're using Guava
there's also Files.readLines(..)
and Files.toString(..)
.
You can use JSoup.
It's a very strong HTML parser
for java