What problems could occur when chaining 40 shift registers?
Use Schmitt-trigger buffers at the inputs of each board. They will clean up the signals so that any noise won't give false pulses on the clock, for instance. The 74LVC3G17 is a triple non-inverting buffer.
Also, pass the buffered signals to the next board. Otherwise all inputs would be parallel and you may exceed the fan-out of the driving microcontroller (I'm especially thinking of the total capacitive load). The daisy chain of clock and latch signals will give a ripple delay throughout the chain, but the data will do so as well, and you plan to go for low speed anyway.
The problem that can occur is that some SR clocks before the next SR clocks, so that next SR will clock in the wrong data. A (standard?) solution for this is to wire the clock starting at the last SR.
I would consider adding a (schmit-trigger?) buffer at each board for all 3 signal lines.
(edit) Lowering the clock frequency won't help (unless it was far too high to begin with). The problems you can have occur at the clock edges, which you will have anyway, no matter how low you choose your clock frequency.
The biggest issue when chaining shift registers is ensuring that the timing relationship between the clock used by each board uses for receiving data and the change in data from the previous board is predictable. The fact that the output of the 74HC595 changes on the same edge as the clock is a little annoying in that regard. I would suggest that the clock signal should be buffered as it goes through each board and that the data signal coming out of one board's 74HC595 should be put through a buffer that will delay it by a time slightly longer than the clock buffer.
Alternatively, you could use a shift register like the 74HC4094 which has its data output change on the falling clock edge, or you could add a flip flop between the output of the last 74HC595 on the board and the next board, and have that flip flop latch its output on the falling edge of the clock that drives the 74HC595's (perhaps pass the clock through two inverters to buffer it and feed the inverted clock signal to the flip flop).
If the number of 74HC595 outputs you'll be using is one (or more) less than the number supplied by your chips (e.g. on a board with two 74HC595's you actually only need 15 outputs) you could feed the last 74HC595 on a board with a clock inverted from the others, but that would cost you the use of one 74HC595 output for each time the signal passes between a non-inverted-clock 74HC595 and an inverted-clock 74HC595.