Use AWK to read duplicates in a column
Using GNU awk for mktime():
$ cat tst.awk
BEGIN { FS = "|" }
(++count[$2]) ~ /^[15]$/ {
split($1,t,"[/:]")
monthNr = (index("JanFebMarAprMayJunJulAugSepOctNovDec",t[2])+2)/3
currSecs = mktime(t[3] " " monthNr " " t[1] " " t[4] " " t[5] " " t[6])
if ( count[$2] == 1 ) {
firstSecs[$2] = currSecs
}
else if ( (currSecs - firstSecs[$2]) < 15 ) {
print $2
}
}
$ awk -f tst.awk file
000.111.026.111
060.121.125.144
I think it's very clear what it's doing so no need to add text explaining it but if you have any questions please feel free to ask.
Oh, and you mentioned in a comment wishing you knew a way to convert your IP addresses to dummy values so you could post a more comprehensive example, well here's one way that'd be good enough for your specific problem:
$ awk '
BEGIN { FS=OFS="|" }
!($2 in map) { ip=sprintf("%012d",++cnt); gsub(/.../,"&.",ip); sub(/.$/,"",ip); map[$2]=ip }
{ $2=map[$2]; print }
' file
29/Oct/2020:07:41:42|000.000.000.001|200|/page-a/
29/Oct/2020:08:30:40|000.000.000.002|200|/page-a/
29/Oct/2020:08:30:44|000.000.000.002|200|/page-b/
29/Oct/2020:08:30:45|000.000.000.002|200|/page-c/
29/Oct/2020:08:30:47|000.000.000.002|200|/page-d/
29/Oct/2020:08:30:47|000.000.000.003|200|/page-h/
29/Oct/2020:08:30:48|000.000.000.002|200|/page-e/
29/Oct/2020:07:41:49|000.000.000.004|200|/page-a/
29/Oct/2020:08:41:52|000.000.000.005|200|/page-f/
29/Oct/2020:08:41:52|000.000.000.005|200|/page-g/
29/Oct/2020:08:41:54|000.000.000.002|200|/page-k/
29/Oct/2020:08:41:55|000.000.000.005|200|/page-l/
29/Oct/2020:08:41:57|000.000.000.005|200|/page-n/
29/Oct/2020:08:41:58|000.000.000.005|200|/page-s/
Edit: here's how you could have started to investigate the difference between the output my script produces and the output the version of Daves script you ran produces:
$ awk -f morton-botfilter.awk.txt output3test.csv > morton.out
$ awk -f dave-botfilter.awk.txt output3test.csv > dave.out
$ ip=$(comm -13 <(sort morton.out) <(sort dave.out) | head -1)
$ grep "$ip" output3test.csv | head -5
03/Nov/2020:07:52:55|000.000.000.007|200|/page-7/
03/Nov/2020:08:05:32|000.000.000.007|200|/page-11/
03/Nov/2020:11:28:56|000.000.000.007|200|/page-77/
03/Nov/2020:13:52:32|000.000.000.007|200|/page-143/
03/Nov/2020:13:52:33|000.000.000.007|200|/page-144/
Note that there's far longer than 15 seconds between the first and last timestamps above which tells you that the script in dave-botfilter.awk.txt is broken. See the comments below for more info.
Since you want to learn awk, and apparently have GNU awk (gawk), awk -f script <logfile
where script
contains
BEGIN{ split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec",n2m);
for(i=1;i<=12;i++) m2n[n2m[i]]=i; FS="|"; }
function fixtime(str ,tmp){ split(str,tmp,"[:/]");
return mktime(tmp[3] OFS m2n[tmp[2]] OFS tmp[1] OFS tmp[4] OFS tmp[5] OFS tmp[6]) }
++count[$2]==1 { first[$2]=fixtime($1) }
count[$2]==5 && fixtime($1)-first[$2]<15 { print $2 }
The first two lines set up an array m2n (month to number) which maps Jan to 1, Feb to 2, etc. and also sets the field delimiter to |
. (It could instead do m2n["Jan"]=1; m2n["Feb"]=2;
etc but that's more tedious.
The next two lines define a function which splits your time format using all /
and :
as delimiters (without needing to first translate them to space), converts the month name to a number, reorders as needed and feeds to mktime()
(gawk only). Instead of OFS (which defaults to one space and hasn't been changed) you can use literal " "
but I find that uglier.
The fifth and sixth lines find the first occurrence of any IPaddr and remember its timestamp, and the fifth occurrence of the same IPaddr and compare its timestamp to the remembered one to see if the interval is less than 15 seconds. Some people would put a ;next
in the action on the fifth line to make clear that the fifth and sixth script lines will not execute on the same record (i.e. data line) but I didn't bother.
QEF.
If you prefer you can put the whole script on the commandline in '...'
instead of using a script file, but I don't like doing that for more than about 100 characters.