Detecting Slashdot effect in nginx
I think this would be far better done with logtail and grep. Even if it's possible to do with lua inline, you don't want that overhead for every request and you especially don't want it when you have been Slashdotted.
Here's a 5-second version. Stick it in a script and put some more readable text around it and you're golden.
5 * * * * logtail -f /var/log/nginx/access_log -o /tmp/nginx-logtail.offset | grep -c "http://[^ ]slashdot.org"
Of course, that completely ignores reddit.com and facebook.com and all of the million other sites that could send you lots of traffic. Not to mention 100 different sites sending you 20 visitors each. You should probably just have a plain old traffic threshold that causes an email to be sent to you, regardless of referrer.
The nginx limit_req_zone directive can base its zones on any variable, including $http_referrer.
http {
limit_req_zone $http_referrer zone=one:10m rate=1r/s;
...
server {
...
location /search/ {
limit_req zone=one burst=5;
}
You will also want to do something to limit the amount of state required on the web server though, as the referrer headers can be quite long and varied and you may see an infinte variet. You can use the nginx split_clients feature to set a variable for all requests that is based on the hash of the referrer header. The example below uses only 10 buckes, but you could do it with 1000 just as easily. So if you got slashdotted, people whose referrer happened to hash into the same bucket as the slashdot URL would get blocked too, but you could limit that to 0.1% of visitors by using 1000 buckets in split_clients.
It would look something like this (totally untested, but directionally correct):
http {
split_clients $http_referrer $refhash {
10% x01;
10% x02;
10% x03;
10% x04;
10% x05;
10% x06;
10% x07;
10% x08;
10% x09;
* x10;
}
limit_req_zone $refhash zone=one:10m rate=1r/s;
...
server {
...
location /search/ {
limit_req zone=one burst=5;
}
The most efficient solution might be to write a daemon that would tail -f
the access.log
, and keep track of the $http_referer
field.
However, a quick and dirty solution would be to add an extra access_log
file, to log only the $http_referer
variable with a custom log_format
, and to automatically rotate the log every X minutes.
This can be accomplished with the help of standard logrotate scripts, which might need to do graceful restarts of nginx in order to have the files reopened (e.g., the standard procedure, take a look at /a/15183322 on SO for a simple time-based script)…
Or, by using variables within
access_log
, possibly by getting the minute specification out of$time_iso8601
with the help of themap
or anif
directive (depending on where you'd like to put youraccess_log
).
So, with the above, you may have 6 log files, each covering a period of 10 minutes, http_referer.Txx{0,1,2,3,4,5}x.log
, e.g., by getting the first digit of the minute to differentiate each file.
Now, all you have to do is have a simple shell script that could run every 10 minutes, cat
all of the above files together, pipe it to sort
, pipe it to uniq -c
, to sort -rn
, to head -16
, and you have a list of the 16 most common Referer
variations — free to decide if any combinations of numbers and fields exceeds your criteria, and perform a notification.
Subsequently, after a single successful notification, you could remove all of these 6 files, and, in subsequent runs, not issue any notification UNLESS all six of the files are present (and/or a certain other number as you see fit).