Speed-up the computation of this sum of small matrices
You can write the summation and the outer product as a matrix multiplication:
f2[p_] := Transpose[mat[[;; -(p + 1)]]].mat[[p + 1 ;;]]
ff[p_] := Inner[Times, mat[[1 ;; n - p]], mat[[p + 1 ;; n]], Plus, 1]
Seems 3x faster