legal or ethical pitfalls for web crawler?
Ethical: You should comply with the robots.txt protocol to ensure that you comply with the site-owners' wishes. The Python standard library includes the robotparser module for this purpose.
Yes you should (expect to be IP banned for screen-scraping for unauthorised syndication). Moreover, the less scrupulous, more creative site owners will, instead of blocking your robot, either attempt to crash/confuse it by sending it malformed data, or deliberately send it false data.
If your business model is based on unauthorised screen-scraping, it will fail.
Normally, it is in the site owners' interests to allow you to screen-scrape, so you can get permission (they are unlikely to make a stable API for you though unless you pay them lots of money to do so).
If they don't give you permission, you should probably not.
Some tips:
- Give admins of authorised syndication sites a mechanism to ask you to stop scraping their site, in case your bot causes them operational problems. This could be an email address, but please monitor it.
- If you cannot contact the site owner to get permission, make sure it is easy for them to contact you should the need arise (put a URL or email address in the robot's UA string)
- Make it clear what the purpose of your screen-scraping is, and what your retention and other policies are.
If you do it all in good faith, transparently, you are unlikely to be blocked by a human unless they decide what you're doing is fundamentally against their business model.
If you behave in an underhand, cloak-and-dagger way, you can expect hostility.
Also note that some data are proprietary and is considered by their owners as Intellectual Property. Some sites like currency exchange sites, search engines and stock market trackers particularly don't like their data being crawled since their business is basically selling the very data you're crawling.
That being said, in the US, you cannot copyright data itself - just how you format the data. So according to US law it's OK to grab crawled data as long as you don't store it in its original formatting (HTML).
But, in a lot of European countries data itself can be copyrighted. And the web is a global beast. People from Europe can visit your site. Which according to the law in some countries means that you are doing business in those countries. So even if you are protected legally in the US it doesn't mean that you won't get sued elsewhere in the world.
My advice is go through the site and read about usage policy. If the site explicitly disallows crawling then you shouldn't do it. And as Jim mentioned, respect robots.txt.
Then again, there is ample legal precedent from courts around the world that makes search engines legal. And search engines are themselves voracious web crawlers. On the other hand it looks like almost every year at least one news agency sues or tries to sue Google for web crawling.
With all the above in mind, be very careful what you do with crawled data. I would say private use is OK as long as you don't overload the servers. I myself do it regularly to get TV programming schedule etc.