Python correct encoding of Website (Beautiful Soup)

You are making two mistakes; you are mis-handling encoding, and you are treating a result list as something that can safely be converted to a string without loss of information.

First of all, don't use response.text! It is not BeautifulSoup at fault here, you are re-encoding a Mojibake. The requests library will default to Latin-1 encoding for text/* content types when the server doesn't explicitly specify an encoding, because the HTTP standard states that that is the default.

See the Encoding section of the Advanced documentation:

The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the raw Response.content.

Bold emphasis mine.

Pass in the response.content raw data instead:

soup = BeautifulSoup(r.content)

I see that you are using BeautifulSoup 3. You really want to upgrade to BeautifulSoup 4 instead; version 3 has been discontinued in 2012, and contains several bugs. Install the beautifulsoup4 project, and use from bs4 import BeautifulSoup.

BeautifulSoup 4 usually does a great job of figuring out the right encoding to use when parsing, either from a HTML <meta> tag or statistical analysis of the bytes provided. If the server does provide a characterset, you can still pass this into BeautifulSoup from the response, but do test first if requests used a default:

encoding = r.encoding if 'charset' in r.headers.get('content-type', '').lower() else None
parser = 'html.parser'  # or lxml or html5lib
soup = BeautifulSoup(r.content, parser, from_encoding=encoding)

Last but not least, with BeautifulSoup 4, you can extract all text from a page using soup.get_text():

text = soup.get_text()
print text

You are instead converting a result list (the return value of soup.findAll()) to a string. This never can work because containers in Python use repr() on each element in the list to produce a debugging string, and for strings that means you get escape sequences for anything not a printable ASCII character.

It's not BeautifulSoup's fault. You can see this by printing out encodedText, before you ever use BeautifulSoup: the non-ASCII characters are already gibberish.

The problem here is that you are mixing up bytes and characters. For a good overview of the difference, read one of Joel's articles, but the gist is that bytes are, well, bytes (groups of 8 bits without any further meaning attached), whereas characters are the things that make up strings of text. Encoding turns characters into bytes, and decoding turns bytes back into characters.

A look at the requests documentation shows that r.text is made of characters, not bytes. You shouldn't be encoding it. If you try to do so, you will make a byte string, and when you try to treat that as characters, bad things will happen.

There are two ways to get around this:

Use the raw undecoded bytes, which are stored in r.content, as Martijn suggested. Then you can decode them yourself to turn them into characters.
Let requests do the decoding, but just make sure it uses the right codec. Since you know that's UTF-8 in this case, you can set r.encoding = 'utf-8'. If you do this before you access r.text, then when you do access r.text, it will have been properly decoded, and you get a character string. You don't need to mess with character encodings at all.

Incidentally, Python 3 makes it somewhat easier to maintain the difference between character strings and byte strings, because it requires you to use different types of objects to represent them.

Python correct encoding of Website (Beautiful Soup)

Tags:

Python

Encoding

Utf 8

Beautifulsoup

Mojibake

Related

Recent Posts