How to change the length of time-slices used by the Linux CPU scheduler?
For most RHEL7 servers, RedHat suggest increasing sched_min_granularity_ns
to 10ms and sched_wakeup_granularity_ns
to 15ms. (Source. Technically this link says 10 μs, which would be 1000 times smaller. It is a mistake).
We can try to understand this suggestion in more detail.
Increasing sched_min_granularity_ns
On current Linux kernels, CPU time slices are allocated to tasks by CFS, the Completely Fair Scheduler. CFS can be tuned using a few sysctl
settings.
kernel.sched_min_granularity_ns
kernel.sched_latency_ns
kernel.sched_wakeup_granularity_ns
You can set sysctl's temporarily until the next reboot, or permanently in a configuration file which is applied on each boot. To learn how to apply this type of setting, look up "sysctl" or read the short introduction here.
sched_min_granularity_ns
is the most prominent setting. In the original sched-design-CFS.txt this was described as the only "tunable" setting, "to tune the scheduler from 'desktop' (low latencies) to 'server' (good batching) workloads."
In other words, we can change this setting to reduce overheads from context-switching, and therefore improve throughput at the cost of responsiveness ("latency").
I think of this CFS setting as mimicking the previous build-time setting, CONFIG_HZ. In the first version of the CFS code, the default value was 1 ms, equivalent to 1000 Hz for "desktop" usage. Other supported values of CONFIG_HZ were 250 Hz (the default), and 100 Hz for the "server" end. 100 Hz was also useful when running Linux on very slow CPUs, this was one of the reasons given when CONFIG_HZ was first added as an build setting on X86.
It sounds reasonable to try changing this value up to 10 ms (i.e. 100 Hz), and measure the results. Remember the sysctls are measured in ns. 1 ms = 1,000,000 ns.
We can see this old-school tuning for 'server' was still very relevant in 2011, for throughput in some high-load benchmark tests: https://events.static.linuxfound.org/slides/2011/linuxcon/lcna2011_rajan.pdf
And perhaps a couple of other settings
The default values of the three settings above look relatively close to each other. It makes me want to keep things simple and multiply them all by the same factor :-). But I tried to look into this and it seems some more specific tuning might also be relevant, since you are tuning for throughput.
sched_wakeup_granularity_ns
concerns "wake-up pre-emption". I.e. it controls when a task woken by an event is able to immediately pre-empt the currently running process. The 2011 slides showed performance differences for this setting as well.
See also "Disable WAKEUP_PREEMPT" in this 2010 reference by IBM, which suggests that "for some workloads" this default-on feature "can cost a few percent of CPU utilization".
SUSE Linux has a doc that suggests setting this to larger than half of sched_latency_ns
will effectively disable wake-up pre-emption, and then "short duty cycle tasks will be unable to compete with CPU hogs effectively".
The SUSE document also suggest some more detailed descriptions of the other settings. You should definitely check what the current default values are on your own systems though. For example the default values on my system seem slightly different to what the SUSE doc says.
https://www.suse.com/documentation/opensuse121/book_tuning/data/sec_tuning_taskscheduler_cfs.html
If you experiment with any of these scheduling variables, I think you should also be aware that all three are scaled (multiplied) by 1+log_2 of the number of CPUs. This scaling can be disabled using kernel.sched_tunable_scaling
. I could be missing something, but this seems surprising e.g. if you are considering the responsiveness of servers providing interactive apps and running at/near full load, and how that responsiveness will vary with the number of CPUs per server.
Suggestion if your workload has large numbers of threads / processes
I also came across a 2013 suggestion, for a couple of other settings, that may gain significant throughput if your workload has large numbers of threads. (Or perhaps more accurately, it re-gains the throughput which they had obtained on pre-CFS kernels).
- "Two Necessary Kernel Tweaks" - discussion on PostgreSQL mailing list.
- "Please increase kernel.sched_migration_cost in virtual-host profile" - Red Hat Bug 969491.
Ignore CONFIG_HZ
I think you don't need to worry about what CONFIG_HZ
is set to. My understanding is it is not relevant on current kernels, assuming you have reasonable timer hardware. See also commit 8f4d37ec073c, "sched: high-res preemption tick", found via this comment in a thread about the change: https://lwn.net/Articles/549754/ .
(If you look at the commit, I wouldn't worry that SCHED_HRTICK
depends on X86
. That requirement seems to have been dropped in some more recent commit).