Node.js Cheerio parser breaks UTF-8 encoding
I was having an issue early today when tried to load with cheerio a page where we had special characters like ç
, á
, é
, etc...
The way cheerio works is that is tries to decode characters by nature and present the numerical HTML encoding of the Unicode character
for example: instead of ç
it would give us ç
.
In order to sort that issue, I just had to turn off this config by adding: decodeEntities: false
as a cheerio load param.
const $ = cheerio.load(body, { decodeEntities: false });
Cheerio hasn't broken anything. It's outputting HTML entities, which will be rendered by any browser exactly the same as the HTML input. Run this snippet to see what I mean:
<h1><span style="font-size: 16px;">Уличная 3Мп IP HD камера OMNY - попробуйте найти лучше</span></h1>
<h1><span style="font-size: 16px;">Уличная 3Мп IP HD камера OMNY - попробуйте найти лучше</span></h1>
У
, for example, is the character У
encoded as an HTML entity, in the same way the entity >
represents >
.
However, if you want to get the unencoded text, you can set the decodeEntities
option to false
:
const $ = cheerio.load(
`<h1><span style="font-size: 16px;">Уличная 3Мп IP HD камера OMNY - попробуйте найти лучше</span></h1>`,
{ decodeEntities: false }
);
console.log($('span').html())
// => Уличная 3Мп IP HD камера OMNY - попробуйте найти лучше
.as-console-wrapper{min-height:100%}
<script src="https://bundle.run/[email protected]"></script>