python requests.get always get 404
One thing to note: I was using requests.get()
to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n
) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r
or \n
characters before you call requests.get("your link")
. In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)
Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
- Record all aspects of the working request
- Record all aspects of the failing request
- Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests
, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host
; this must be set to the hostname you are contacting, so that it can properly multi-host different sites.requests
sets this one.Content-Length
andContent-Type
, for POST requests, are usually set from the arguments you pass torequests
. If these don't match, alter the arguments you pass in torequests
(but watch out withmultipart/*
requests, which use a generated boundary recorded in theContent-Type
header; leave generating that torequests
).Connection
: leave this to the client to manageCookies
: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with arequests.Session()
object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests
has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python
, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests
is not a browser. requests
is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests
results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests
as needed. If all else fails, use a project like requests-html
, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1®ion=1
, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.
In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)