Cheerio: Extract Text from HTML with separators
This seems to do the trick:
var t = $('html *').contents().map(function() {
return (this.type === 'text') ? $(this).text() : '';
}).get().join(' ');
console.log(t);
Result:
One Two
Just improved my solution a little bit:
var t = $('html *').contents().map(function() {
return (this.type === 'text') ? $(this).text()+' ' : '';
}).get().join('');
You can use the TextVersionJS package to generate the plain text version of an html string. You can use it on the browser and in node.js as well.
var createTextVersion = require("textversionjs");
var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
var textVersion = createTextVersion(yourHtml);
Download it from npm and require it with Browserify for example.
You can use the following function to extract the text from an html separated by a whitespace
:
function extractTextFromHtml(html: string): string {
const cheerioStatic: CheerioStatic = cheerio.load(html || '');
return cheerioStatic('html *').contents().toArray()
.map(element => element.type === 'text' ? cheerioStatic(element).text().trim() : null)
.filter(text => text)
.join(' ');
}