How many CPUs should be utilised with Hyperthreading?
Solution 1:
CPU meters are very bad for telling you how much more performance you can squeeze out of your hyperthreaded CPUs. For that, you should run your own benchmarks at various physical-core over-subscription rates. There are some workloads that work best with HT completely turned off, so include that case in your testing as well. It could be a 1:2 (36 parallel workers), or 1:1.5, or even 1:2.5! It depends on your workload.
In more detail, HT is implemented on the silicon in ways that reduce the time the processor spends idle when a context needs to get switched or a branch-prediction fails. This makes it easier to reach 100% execution unit usage than with pure operating-system tricks. HT has evolved since its introduction, and there is more parallelism on modern chips than the ones we were using 10 years ago.
There are two execution profiles that will affect where your optimal over-subscription point is:
- Long execution duration. If your workers run for minutes or hours before recycling, such as large rendering jobs or environment modeling, you'll get more efficient single-core performance per worker. This will lower your ratio.
- Short execution duration. If your workers cycle in seconds or small minutes, such as web-app threads, the overhead involved in turning on a new process means your ratio will be higher.
Solution 2:
If the second virtual core is allowed to contribute when the first would otherwise be stuck, it's better than not, so you get (at least) a little extra work done.
The question becomes: when does having two different threads cause one to run worse? The branch prediction and dependencies between instructions won't change. Waiting on memory access now... the two threads compete over memory access, both in cache utilization and bandwidth.
If you have some CPUs running with HT and others not, does that also mean you will assign specific threads to one type or the other? I think not: your programs will run their threads on random virtual cores. So how does splitting the configuration help? Since each CPU has its own cache, the only affect is due to memory bandwidth and the burden of cache coherancy.
In general, you reach a point where having something more you could be doing is more expensive than letting some CPU execution units go idle. This does not depend on the number of threads directly, but on what the threads are doing, and the detailed memory architecture and performance nuances of the various components.
There is no simple answer. Even with a specific program in mind, the machine may differ from those of people relating their own experiences.
You have to try it yourself and measure what is fastest, with that specific work on that exact machine. And even then, it may change with software updates and shifting usage over time.
Take a look at volume 3 of Anger's magnum opus. If you look carefully at some specific processor, you can find limiting resources among the deep pipeline of many steps needed to execute code. You need to find a case where over-comittment causes it to execute slower, as opposed to not taking in more work. In general that would mean some kind of caching; and where the resource is shared among threads.
What does the CPU meter mean: it reports all time that's not spent running the idle thread. Both logical threads assigned to a core will not be idle even though the actual work done on one of them may be small. Time spent with the pipeline stuck for a few cycles until results are ready, memory is fetched, atomic operations are fenced in, etc. likewise don't cause the thread to be shelved as "not ready" so it won't be idle, and time still shows as in-use. Waiting on RAM will not show as idle. Only something like I/O will make the thread block and stop charging time towards it. An operating-system mutex in general will do so, but with the rise of multicore systems that's no longer a sure thing, as a "spinlock" will not make the thread go back on the shelf.
So, a CPU meter of 100% doesn't mean all is smooth sailing, if the CPU is often stuck waiting for memory. A fewer number of logical cores showing 90% could very well be getting more work done, as it finishes the number crunching and is now waiting on the disk.
So don't worry about the CPU meter. Look at actual progress made, only.
Solution 3:
You should see all 36 cores running at 100 % - assuming the software can do that (which is not trivial - scheduling can be tricky with that many cores, so dips below 100% are acceptable).
Obviously when you "split" a ore with hyperthreading, the meaning of those 200% are not "2x100% - in work done. But this is invisible to any measurement taken (which comes from CPU utilization and has no concept of work done). How much work this gets done depends on what the work is - somewhere above 1.5 x the work without hyper threading is to be expected most of the time.
Solution 4:
The way hyperthreading is implemented varies with the specific CPU uarch. From Nehalem to Skylake, Intel significantly reduced the fixed-ratio (ie: 50/50) shared parts of the pipeline, heading to dinamically shared structures.
Anyway, in general terms, enabling HT led to sligtly slower single-thread execution, but due to how the Linux scheduler works, this only happen when the number or running thread is higher than the number of physical cores. As in such situations (when threads > cores) you typically value the total throughput of maximum importance, hyperthreading remains a net win.
How this is possible? The key point to understand is that the CPU does not present the physical cores and the virtual ones as equal cores, rather it exposes the latter in a manner than the Linux scheduler can avoid scheduling on them if any other physical cores is available. In other words, it first uses all the physical cores, then it begin to use the virtual one.
This means than, generally, HyperThreading is a very valuable feature (other processors, as Power8, uses even deeper SMT techniques) and that to maximize throughput you should enable it, loading the CPU with a least one thread per virtual or physical core. For a practical example, to extract full performance from a 18-core CPU you should use at least 36 threads.
Two exceptions exist:
- if all you want is minimize latency from a limited set of threads (where threads < physical cores), you can disable HT
- very old CPU (Pentium4 and, in a much smaller way, Nehalem) have inflexible partition rules which force the CPU to split many key resources at 50/50 ratio, independetly from the second thread's status/load. In this case, you had to benchmark your use case to be sure that the added throughput is worth the significantly lower single thread performance.