IncompleteRead using httplib

At the end of the day, all of the other modules (feedparser, mechanize, and urllib2) call httplib which is where the exception is being thrown.

Now, first things first, I also downloaded this with wget and the resulting file was 1854 bytes. Next, I tried with urllib2:

Click to copy

>>> import urllib2
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> f = urllib2.urlopen(url)
>>> f.headers.headers
['Cache-Control: private\r\n',
 'Content-Type: text/xml; charset=utf-8\r\n',
 'Server: Microsoft-IIS/7.5\r\n',
 'X-AspNet-Version: 4.0.30319\r\n',
 'X-Powered-By: ASP.NET\r\n',
 'Date: Mon, 07 Jan 2013 23:21:51 GMT\r\n',
 'Via: 1.1 BC1-ACLD\r\n',
 'Transfer-Encoding: chunked\r\n',
 'Connection: close\r\n']
>>> f.read()
< Full traceback cut >
IncompleteRead: IncompleteRead(1854 bytes read)

So it is reading all 1854 bytes but then thinks there is more to come. If we explicitly tell it to read only 1854 bytes it works:

Click to copy

>>> f = urllib2.urlopen(url)
>>> f.read(1854)
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'

Obviously, this is only useful if we always know the exact length ahead of time. We can use the fact the partial read is returned as an attribute on the exception to capture the entire contents:

Click to copy

>>> try:
...     contents = f.read()
... except httplib.IncompleteRead as e:
...     contents = e.partial
...
>>> print contents
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'

This blog post suggests this is a fault of the server, and describes how to monkey-patch the httplib.HTTPResponse.read() method with the try..except block above to handle things behind the scenes:

Click to copy

import httplib

def patch_http_response_read(func):
    def inner(*args):
        try:
            return func(*args)
        except httplib.IncompleteRead, e:
            return e.partial

    return inner

httplib.HTTPResponse.read = patch_http_response_read(httplib.HTTPResponse.read)

I applied the patch and then feedparser worked:

Click to copy

>>> import feedparser
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> feedparser.parse(url)
{'bozo': 0,
 'encoding': 'utf-8',
 'entries': ...
 'status': 200,
 'version': 'rss20'}

This isn't the nicest way of doing things, but it seems to work. I'm not expert enough in the HTTP protocols to say for sure whether the server is doing things wrong, or whether httplib is mis-handling an edge case.

I find out in my case, send a HTTP/1.0 request , fix the problem, just adding this to the code:

Click to copy

import httplib
httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'

after I do the request :

Click to copy

req = urllib2.Request(url, post, headers)
filedescriptor = urllib2.urlopen(req)
img = filedescriptor.read()

after I back to http 1.1 with (for connections that support 1.1) :

Click to copy

httplib.HTTPConnection._http_vsn = 11
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.1'

IncompleteRead using httplib

Tags:

Python

Feedparser

Httplib

Related

Recent Posts