Can compilers generate self modifying code?
There's nothing preventing a compiler from implementing what you suggest but it's a rather heavyweight solution to a very minor performance problem.
To implement the self-modifying code the compiler, for a typical C++ implementation running on Windows or Linux, would have to insert code that would change the permissions on the code page(s), modify the code, and then restore the permissions. These operations could easily cost far more cycles than then the implied "if" operation would take over the lifetime of the program.
This would also have the consequence of preventing the modified code pages from being shared between processes. That may seem inconsequential, but compilers often pessimize their code (pretty badly in the case of i386) in order to implement position independent code that can be loaded a different addresses at runtime without modifying the code and preventing sharing of code pages.
As Remy Lebeau and Nathan Oliver mention in comments there are also thread safety issues to consider, but they can probably be dealt with as there various solutions for hot patching executables like this.
Back in the olden days, the 8086 processor didn’t know anything about floating-point math. You could add a math coprocessor, the 8087, and write code that used it. Fo-code consisted of “trap” instructions that transferred control to the 8087 to execute a floating-point operation.
Borland’s compiler could be set to generate floating-point code that detected at runtime whether there was a coprocessor installed. The first time each fp instruction was executed, it would jump to an internal routine that would backpatch the instruction, with an 8087 trap instruction (followed by a couple of NOPs) if there was a coprocessor, and a call to an appropriate library routine if there wasn’t. Then the internal routine would jump back to the patched instruction.
So, yes, I can be done. Sort of. As various comments have pointed out, modern architectures make this kind of thing hard or impossible.
Earlier versions of Windows had a system call that re-mapped memory segment selectors between data and code. If you called PrestoChangoSelector
(yes, that was its name) with a data segment selector it would give you back a code segment selector that pointed at the same physical memory, and vice versa.
Yes, that would be legal. ISO C++ makes zero guarantees about being able to access data (machine code) through function pointers cast to unsigned char*
. On most real implementations it's well defined, except on pure-Harvard machines where code and data have separate address spaces.
Hot-patching (usually by external tools) is a thing, and is very doable if compilers generate code to make that easy, i.e. the function starts with a long-enough instruction that can be atomically replaced.
As Ross points out, a major obstacle to self-modification on most C++ implementations is that they make programs for OSes that normally map executable pages read-only. W^X is an important security feature to avoid code-injection. Only for very long-running programs with very hot code paths would it be overall worth it to make necessary system calls to make the page read+write+exec temporary, atomically modify an instruction, then flip it back.
And impossible on systems like OpenBSD that truly enforce W^X, not letting a process mprotect
a page with both PROT_WRITE and PROT_EXEC. Making a page temporarily non-executable doesn't work if other threads can call the function at any moment.
It is commonly said that a static variable initialization is wrapped in an if to prevent it from being initialized multiple times.
Only for non-constant initializers, and of course only for static locals. A local like static int foo = 1;
will compile the same as at global scope, to a .long 1
(GCC for x86, GAS syntax) with a label on it.
But yes, with a non-constant initializer, compilers will invent a guard variable they can test. They arrange things so the guard variable is read-only, not like a readers/writers lock, but that does still cost a couple extra instructions on the fast path.
e.g.
int init();
int foo() {
static int counter = init();
return ++counter;
}
compiled with GCC10.2 -O3 for x86-64
foo(): # with demangled symbol names
movzx eax, BYTE PTR guard variable for foo()::counter[rip]
test al, al
je .L16
mov eax, DWORD PTR foo()::counter[rip]
add eax, 1
mov DWORD PTR foo()::counter[rip], eax
ret
.L16: # slow path
acquire lock, one thread does the init while the others wait
So the fast path check costs 2 uops on mainstream CPUs: one zero-extending byte load, one macro-fused test-and-branch (test + je
) that's not-taken. But yes, it has non-zero code-size for both L1i cache and decoded-uop cache, and non-zero cost to issue through the front-end. And an extra byte of static data that has to stay hot in cache for good performance.
Normally inlining makes this negligible. If you're actually call
ing a function with this at the start often enough to matter, the rest of the call/ret overhead is a larger problem.
But things aren't so nice on ISAs without cheap acquire loads. (e.g. ARM before ARMv8). Instead of somehow arranging to barrier() all threads once after initializing the static variable, every check of the guard variable is an acquire load. But on ARMv7 and earlier, that's done with a full memory barrier dmb ish
(data memory barrier: inner shareable) that includes draining the store buffer, exactly the same as for atomic_thread_fence(mo_seq_cst)
. (ARMv8 has ldar
(word) / ldab
(byte) to do acquire loads, making them nice and cheap.)
Godbolt with ARMv7 clang
# ARM 32-bit clang 10.0 -O3 -mcpu=cortex-a15
# GCC output is even more verbose because of Cortex-A15 tuning choices.
foo():
push {r4, r5, r11, lr}
add r11, sp, #8
ldr r5, .LCPI0_0 @ load a PC-relative offset to the guard var
.LPC0_0:
add r5, pc, r5
ldrb r0, [r5, #4] @ load the guard var
dmb ish @ full barrier, making it an acquire load
tst r0, #1
beq .LBB0_2 @ go to slow path if low bit of guard var == 0
.LBB0_1:
ldr r0, .LCPI0_1 @ PC-relative load of a PC-relative offset
.LPC0_1:
ldr r0, [pc, r0] @ load counter
add r0, r0, #1 @ ++counter leaving value in return value reg
str r0, [r5] @ store back to memory, IDK why a different addressing mode than the load. Probably a missed optimization.
pop {r4, r5, r11, pc} @ return by popping saved LR into PC
But just for fun, let's look at exactly how your idea could be implemented.
Assuming you can PROT_WRITE|PROT_EXEC (to use POSIX terminology) a page containing the code, it's not a hard problem to solve for most ISAs, such as x86.
Start the function with jmp rel32
or whatever to a "cold" section of code that does mutual exclusion to run the non-constant static initializer in one thread. (So if you do have multiple threads start to run it before one finishes and modifies the code, it all works the way it does now.)
Once construction is fully done, use an 8-byte atomic CAS or store to replace that 5-byte instruction with different instruction bytes. Possibly just a NOP, or possibly something useful that was done at the top of the "cold" code.
Or on non-x86 with fixed-width instructions of the same width it can atomically store, just a single word store can replace one jump instruction.