Speeding up generation of block diagonal matrix

You are right, it can be done in a fraction of second. One can explicitly construct an array of indexes

blockArray[mat_] := SparseArray[
  Tuples[Range@# - {1, 0, 0}].{Rest@#, {1, 0}, {0, 1}} &@Dimensions@mat -> 
   Flatten@mat]

Timings:

matrices = RandomReal[1, {48, 128, 128}];

s1 = 
   SparseArray@
    ArrayFlatten@ReleaseHold@DiagonalMatrix[Hold /@ matrices]; // RepeatedTiming
(* {7.56, Null} *)

s2 = SparseArray[Band@{1, 1} -> matrices]; // RepeatedTiming
(* {4.03, Null} *)

s3 = blockArray[matrices]; // RepeatedTiming
(* {0.097, Null} *)

TrueQ[s1 == s2 == s3]
(* True *)

For further acceleration you can create the internal structure of the SparseArray directly

c = Compile[{{b, _Integer}, {h, _Integer}, {w, _Integer}},
   Partition[Flatten@Table[k + i w, {i, 0, b - 1}, {j, h}, {k, w}], 1], 
   CompilationTarget -> "C", RuntimeOptions -> "Speed"];

blockArray2[mat_] := 
 SparseArray @@ {Automatic, # {##2}, 
     0, {1, {Range[0, 1 ##, #3], c@##}, Flatten@mat}} & @@ Dimensions@mat

s4 = blockArray2[matrices]; // RepeatedTiming
(* {0.031, Null} *)

s3 == s4
(* True *)

There is an undocumented built-in solution:

rules = {#, #} -> RandomReal[{}, {128, 128}] & /@ Range[48];
SparseArray`SparseBlockMatrix[rules]; // RepeatedTiming

(* {0.042, Null} *)