DirectCompute optimal numthreads setup
It's pretty GPU-specific but if you are on NVIDIA hardware you can try using the CUDA Occupancy Calculator.
I know you are using DirectCompute, but they map to the same underlying hardware. If you look at the output of FXC you can see the shared memory size and registers per thread in the assembly. Also you can deduce the compute capability from which card you have. Compute capability is the CUDA equivalent of profiles like cs_4_0, cs_4_1, cs_5_0, etc.
The goal is to increase the "occupancy", or in other words occupancy == 100% - %idle-due-to-HW-overhead
Profiling is the only way to guarantee maximum performance on a particular piece of hardware. But as a general rule, as long as you keep your live register count low (16 or lower) and don't use a ton of shared memory, thread groups of exactly 256 threads should be able to saturate most compute hardware (assuming you're dispatching at least 8 or so groups).