Nokogiri, open-uri, and Unicode Characters
Summary: When feeding UTF-8 to Nokogiri through open-uri, use open(...).read
and pass the resulting string to Nokogiri.
Analysis:
If I fetch the page using curl, the headers properly show Content-Type: text/html; charset=UTF-8
and the file content includes valid UTF-8, e.g. "Genealogía de Jesucristo"
. But even with a magic comment on the Ruby file and setting the doc encoding, it's no good:
# encoding: UTF-8
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI'))
doc.encoding = 'utf-8'
h52 = doc.css('h5')[1]
puts h52.text, h52.text.encoding
#=> Genealogà a de Jesucristo
#=> UTF-8
We can see that this is not the fault of open-uri:
html = open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI')
gene = html.read[/Gene\S+/]
puts gene, gene.encoding
#=> Genealogía
#=> UTF-8
This is a Nokogiri issue when dealing with open-uri, it seems. This can be worked around by passing the HTML as a raw string to Nokogiri:
# encoding: UTF-8
require 'nokogiri'
require 'open-uri'
html = open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI')
doc = Nokogiri::HTML(html.read)
doc.encoding = 'utf-8'
h52 = doc.css('h5')[1].text
puts h52, h52.encoding, h52 == "Genealogía de Jesucristo"
#=> Genealogía de Jesucristo
#=> UTF-8
#=> true
I was having the same problem and the Iconv approach wasn't working. Nokogiri::HTML
is an alias to Nokogiri::HTML.parse(thing, url, encoding, options)
.
So, you just need to do:
doc = Nokogiri::HTML(open(link).read, nil, 'utf-8')
and it'll convert the page encoding properly to utf-8. You'll see Ragù
instead of Rag\303\271
.