Normal usage vs. denial-of-service? How many requests are needed to talk about a denial of service?
Enough to cause the service to be denied to someone. Might be 1 unexpected malicious request, which causes excessive load on the server. Might be several million expected requests, from a TV advert with a really good response.
There isn't a specific value, since all servers will fail at different levels - serving static content is a lot easier on the server than generating highly customised content for each user, so generally authenticated services will have a lower "problem" threshold than unauthenticated ones. Servers sending the same file to multiple users may be able to handle more traffic than servers sending distinct files to multiple users, since they can hold the file in memory. A server with a fast connection to the internet will usually be able to handle more traffic than one with a slow connection, but the distinction might be less dependent on that if the generated traffic is CPU bound.
I've seen systems that fail at 3 requests per second. I've also seen systems which handle everything up to 30,000 requests per second without breaking a sweat. What would be a DoS to the first, would be a low traffic period to the second...
Updated to respond to update
How do firewall providers determine when traffic is causing a denial of service?
Usually, they watch for response times from the server, and throttle traffic if they go above a pre-set limit (this can be decided on a technical basis, or on a marketing basis - waiting x seconds causes people to leave), or if the server responses change from successful (200) to server failure (50x).
What is a legal definition of "denial of service"?
Same as the original one I gave - it's not denial of service if service has not been denied. It might be abusive, but that wouldn't be quite the same thing.
When you download (or scrape) a website, you are basically sending a lot of GET requests for each URL in the target website.
This is an example of GET request from the World Wide Web Consortium website:
GET /pub/WWW/TheProject.html HTTP/1.1
Host: www.w3.org
As you can see, the main issue is not the request but the response of the webserver, which sends you the whole resource identified by the given URL.
Therefore, we can say that
Max number of requests per second = (Factor of Safety * Bandwidth)/ Max size of a webpage
Judging from a quick Google search, the average size of a webpage is about 2 Mb, and the bandwidth of a web server can range from a few Mbps to a few Tbps.
The factor of safety is related to the fact that, in order to cause a DoS attack, you may not need to send a number of requests corresponding to the 100% of the bandwidth. For example, if the webserver has a 100 Mbps bandwidth, and 50% of it is used at a given instant for other users, it is enough to send a number of requests corresponding to 50%, or even a smaller percentage, of the bandwidth.
50% of 100 Mbps = 50 Mbps, which corresponds to 25 average GET requests per second.
On the other hand, if no one else is visiting the website, you would need to use at least 80% of the bandwidth in order to cause a DoS, and 80% of 100 Mbps = 80 Mbps, which corresponds to 40 GET requests per second.
Clearly, in order to (unintentionally) DoS a huge website having a bandwidth of 1 Tbps, you would need to send at least (80% of 1 Tbps)/2 Mbps =400,000 GET requests per second. And so on.
In order to have a more accurate measurement, you would need to find the maximum size of a webpage in the target website and its bandwidth.
Warning: since you could potentially get in trouble for causing a denial of service, it is better to round down the number of request per second you obtain through the previous formula.
I debated about making this an answer, it might be better as a comment.
Lets take a look at your question from both angles.
From the host
Something becomes a DoS when the about of traffic, or what that traffic is doing, causes the server to be unavailable for others. A few examples;
- Running a long running report 500 times
- smashing refresh really fast on a web site that can't handle it
- using your larger bandwidth to fill their upload pipe so full others loose speed.
- scraping the website in a way that causes there host to be non-responsive to others.
All these examples are possible, but not likely. When we talk about a DoS attack we are talking about one person/client doing all this, and most web servers are set to handle hundreds or thousands of requests at the same time. That's why DDoS is so popular. Because it takes more then one client to overload a normal server (under normal circumstances).
To add complication, many clients may start using your site for the first time after some marketing. Sometimes it's not even your marketing that triggers it. For example a popular cell phone release may cause a spike in traffic on your how to site. It can be very difficult to tell DDoS traffic from legit traffic.
There are a few ground rules though. What your basically looking for is abnormal usage.
- Are there users that are downloading way more then others?
- Are there users that are staying connected way longer then others?
- Are there users that are re-connecting way more then others?
These guides and others, can help you figure out what traffic is part of a DDoS attack and apply some kind of filter.
From the User's POV
When deciding to scrape a website you should check first and see if they have a policy. Some sites do, and some do not. Some sites will consider it theft, and others not. If a site does not have a policy then you have to make your own call.
Your goal, if they do not have a stated policy, is to clearly stated that your scraping (don't mask the user agent or header that your tool might be using), and to try to have a little impact as possible. Depending on your need for scraping, can you scrape just a few pages or do you really need the entire site? Can you scrape at a "normal user" rate, maybe 1 page every 5 seconds or so (including media content)? If you want to capture the data fast can you just capture the text files and not capture the images and other media? Can you exclude long running queries, and larger media files.
You over all goal here it to be respectful of the hosts cost of hosting, and the other users of the site. Slower is usually better in this case. If possible contact the website owner and ask them. And no matter what, follow the rules in the robots.txt file. It can have a rate limit and page limits that you should follow.