How to flush the CPU cache for a region of address space in Linux?
This is for ARM.
GCC provides __builtin___clear_cache
which does should do syscall cacheflush
. However it may have its caveats.
Important thing here is Linux provides a system call (ARM specific) to flush caches. You can check Android/Bionic flushcache for how to use this system call. However I'm not sure what kind of guarantees Linux gives when you call it or how it is implemented through its inner workings.
This blog post Caches and Self-Modifying Code may help further.
Check this page for list of available flushing methods in linux kernel: https://www.kernel.org/doc/Documentation/cachetlb.txt
Cache and TLB Flushing Under Linux. David S. Miller
There are set of range flushing functions
2) flush_cache_range(vma, start, end);
change_range_of_page_tables(mm, start, end);
flush_tlb_range(vma, start, end);
3) void flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned long end)
Here we are flushing a specific range of (user) virtual
addresses from the cache. After running, there will be no
entries in the cache for 'vma->vm_mm' for virtual addresses in
the range 'start' to 'end-1'.
You can also check implementation of the function - http://lxr.free-electrons.com/ident?a=sh;i=flush_cache_range
For example, in arm - http://lxr.free-electrons.com/source/arch/arm/mm/flush.c?a=sh&v=3.13#L67
67 void flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned long end)
68 {
69 if (cache_is_vivt()) {
70 vivt_flush_cache_range(vma, start, end);
71 return;
72 }
73
74 if (cache_is_vipt_aliasing()) {
75 asm( "mcr p15, 0, %0, c7, c14, 0\n"
76 " mcr p15, 0, %0, c7, c10, 4"
77 :
78 : "r" (0)
79 : "cc");
80 }
81
82 if (vma->vm_flags & VM_EXEC)
83 __flush_icache_all();
84 }
In the x86 version of Linux you also can find a function void clflush_cache_range(void *vaddr, unsigned int size)
which is used for the purposes of flush a cache range. This function relies to the CLFLUSH
or CLFLUSHOPT
instructions. I would recommend checking that your processor actually supports them, because in theory they are optional.
CLFLUSHOPT
is weakly ordered. CLFLUSH
was originally specified as ordered only by MFENCE
, but all CPUs that implement it do so with strong ordering wrt. writes and other CLFLUSH
instructions. Intel decided to add a new instruction (CLFLUSHOPT
) instead of changing the behaviour of CLFLUSH
, and to update the manual to guarantee that future CPUs will implement CLFLUSH
as strongly ordered. For this use, you should MFENCE
after using either, to make sure that the flushing is done before any loads from your benchmark (not just stores).
Actually x86 provides one more instruction that could be useful: CLWB
. CLWB
flushes data from cache to memory without (necessarily) evicting it, leaving it clean but still cached. clwb
on SKX does evict like clflushopt
, though
Note also that these instructions are cache coherent. Their execution will affect all caches of all processors (processor cores) in the system.
All these three instructions are available in user mode. Thus, you can employ assembler (or intrinsics like _mm_clflushopt
) and create your own void clflush_cache_range(void *vaddr, unsigned int size)
in your user space application (but do not forget to check their availability, before actual use).
If I correctly understand, it is much more difficult to reason about ARM in this regard. Family of ARM-processors is much less consistent then family of IA-32 processors. You can have one ARM with full-featured caches, and another one completely without caches. Further more, many manufacturers can use customized MMUs and MPUs. So it is better to reason about some particular ARM processor model.
Unfortunately, it looks like that it will be almost impossible to perform any reasonable estimation of time required to flush some data. This time is affected by too many factors including the number of cache lines flushed, unordered execution of instructions, the state of TLB (because instruction takes a virtual address as an argument, but caches use physical addresses), number of CPUs in the system, actual load in terms of memory operations on the other processors in the system, and how many lines from the range are actually cached by processors, and finally by performance of CPU, memory, memory controller and memory bus. In a result, I think execution time will vary significantly in different environments and with different loads. The only reasonable way is to measure the flush time on the system and with load similar to the target system.
And final note, do not confuse memory caches and TLB. They are both caches but organized in different ways and serving different purposes. TLB caches just most recently used translations between virtual and physical addresses, but not data which are pointed by that addresses.
And TLB is not coherent, in contrast to memory caches. Be careful, because flushing of TLB entries does not lead to the flushing of appropriate data from memory cache.