Python correct encoding of Website (Beautiful Soup)
You are making two mistakes; you are mis-handling encoding, and you are treating a result list as something that can safely be converted to a string without loss of information.
First of all, don't use response.text
! It is not BeautifulSoup at fault here, you are re-encoding a Mojibake. The requests
library will default to Latin-1 encoding for text/*
content types when the server doesn't explicitly specify an encoding, because the HTTP standard states that that is the default.
See the Encoding section of the Advanced documentation:
The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the
Content-Type
header containstext
. In this situation, RFC 2616 specifies that the default charset must beISO-8859-1
. Requests follows the specification in this case. If you require a different encoding, you can manually set theResponse.encoding
property, or use the rawResponse.content
.
Bold emphasis mine.
Pass in the response.content
raw data instead:
soup = BeautifulSoup(r.content)
I see that you are using BeautifulSoup 3. You really want to upgrade to BeautifulSoup 4 instead; version 3 has been discontinued in 2012, and contains several bugs. Install the beautifulsoup4
project, and use from bs4 import BeautifulSoup
.
BeautifulSoup 4 usually does a great job of figuring out the right encoding to use when parsing, either from a HTML <meta>
tag or statistical analysis of the bytes provided. If the server does provide a characterset, you can still pass this into BeautifulSoup from the response, but do test first if requests
used a default:
encoding = r.encoding if 'charset' in r.headers.get('content-type', '').lower() else None
parser = 'html.parser' # or lxml or html5lib
soup = BeautifulSoup(r.content, parser, from_encoding=encoding)
Last but not least, with BeautifulSoup 4, you can extract all text from a page using soup.get_text()
:
text = soup.get_text()
print text
You are instead converting a result list (the return value of soup.findAll()
) to a string. This never can work because containers in Python use repr()
on each element in the list to produce a debugging string, and for strings that means you get escape sequences for anything not a printable ASCII character.
It's not BeautifulSoup's fault. You can see this by printing out encodedText
, before you ever use BeautifulSoup: the non-ASCII characters are already gibberish.
The problem here is that you are mixing up bytes and characters. For a good overview of the difference, read one of Joel's articles, but the gist is that bytes are, well, bytes (groups of 8 bits without any further meaning attached), whereas characters are the things that make up strings of text. Encoding turns characters into bytes, and decoding turns bytes back into characters.
A look at the requests
documentation shows that r.text
is made of characters, not bytes. You shouldn't be encoding it. If you try to do so, you will make a byte string, and when you try to treat that as characters, bad things will happen.
There are two ways to get around this:
- Use the raw undecoded bytes, which are stored in
r.content
, as Martijn suggested. Then you can decode them yourself to turn them into characters. - Let
requests
do the decoding, but just make sure it uses the right codec. Since you know that's UTF-8 in this case, you can setr.encoding = 'utf-8'
. If you do this before you accessr.text
, then when you do accessr.text
, it will have been properly decoded, and you get a character string. You don't need to mess with character encodings at all.
Incidentally, Python 3 makes it somewhat easier to maintain the difference between character strings and byte strings, because it requires you to use different types of objects to represent them.