Disable AVX-optimized functions in glibc (LD_HWCAP_MASK, /etc/ld.so.nohwcap) for valgrind & gdb record
There does not seem a straightforward runtime method to patch feature detection. This detection happens rather early in the dynamic linker (ld.so).
Binary patching the linker seems the easiest method at the moment. @osgx described one method where a jump is overwritten. Another approach is just to fake the cpuid result. Normally cpuid(eax=0)
returns the highest supported function in eax
while the manufacturer IDs are returned in registers ebx, ecx and edx. We have this snippet in glibc 2.25 sysdeps/x86/cpu-features.c
:
__cpuid (0, cpu_features->max_cpuid, ebx, ecx, edx);
/* This spells out "GenuineIntel". */
if (ebx == 0x756e6547 && ecx == 0x6c65746e && edx == 0x49656e69)
{
/* feature detection for various Intel CPUs */
}
/* another case for AMD */
else
{
kind = arch_kind_other;
get_common_indeces (cpu_features, NULL, NULL, NULL, NULL);
}
The __cpuid
line translates to these instructions in /lib/ld-linux-x86-64.so.2
(/lib/ld-2.25.so
):
172a8: 31 c0 xor eax,eax
172aa: c7 44 24 38 00 00 00 mov DWORD PTR [rsp+0x38],0x0
172b1: 00
172b2: c7 44 24 3c 00 00 00 mov DWORD PTR [rsp+0x3c],0x0
172b9: 00
172ba: 0f a2 cpuid
So rather than patching branches, we could as well change the cpuid
into a nop
instruction which would result in invocation of the last else
branch (as the registers will not contain "GenuineIntel"). Since initially eax=0
, cpu_features->max_cpuid
will also be 0 and the if (cpu_features->max_cpuid >= 7)
will also be bypassed.
Binary patching cpuid(eax=0)
by nop
this can be done with this utility (works for both x86 and x86-64):
#!/usr/bin/env python
import re
import sys
infile, outfile = sys.argv[1:]
d = open(infile, 'rb').read()
# Match CPUID(eax=0), "xor eax,eax" followed closely by "cpuid"
o = re.sub(b'(\x31\xc0.{0,32}?)\x0f\xa2', b'\\1\x66\x90', d)
assert d != o
open(outfile, 'wb').write(o)
An equivalent Perl variant, -0777
ensures that the file is read at once instead of separating records at line feeds:
perl -0777 -pe 's/\x31\xc0.{0,32}?\K\x0f\xa2/\x66\x90/' < /lib/ld-linux-x86-64.so.2 > ld-linux-x86-64-patched.so.2
# Verify result, should display "Success"
cmp -s /lib/ld-linux-x86-64.so.2 ld-linux-x86-64-patched.so.2 && echo 'Not patched' || echo Success
That was the easy part. Now, I did not want to replace the system-wide dynamic linker, but execute only one particular program with this linker. Sure, that can be done with ./ld-linux-x86-64-patched.so.2 ./a
, but the naive gdb invocations failed to set breakpoints:
$ gdb -q -ex "set exec-wrapper ./ld-linux-x86-64-patched.so.2" -ex start ./a
Reading symbols from ./a...done.
Temporary breakpoint 1 at 0x400502: file a.c, line 5.
Starting program: /tmp/a
During startup program exited normally.
(gdb) quit
$ gdb -q -ex start --args ./ld-linux-x86-64-patched.so.2 ./a
Reading symbols from ./ld-linux-x86-64-patched.so.2...(no debugging symbols found)...done.
Function "main" not defined.
Temporary breakpoint 1 (main) pending.
Starting program: /tmp/ld-linux-x86-64-patched.so.2 ./a
[Inferior 1 (process 27418) exited normally]
(gdb) quit
A manual workaround is described in How to debug program with custom elf interpreter? It works, but it is unfortunately a manual action using add-symbol-file
. It should be possible to automate it a bit using GDB Catchpoints though.
An alternative approach that does not binary linking is LD_PRELOAD
ing a library that defines custom routines for memcpy
, memove
, etc. This will then take precedence over the glibc routines. The full list of functions is available in sysdeps/x86_64/multiarch/ifunc-impl-list.c
. Current HEAD has more symbols compared to the glibc 2.25 release, in total (grep -Po 'IFUNC_IMPL \(i, name, \K[^,]+' sysdeps/x86_64/multiarch/ifunc-impl-list.c
):
memchr, memcmp, __memmove_chk, memmove, memrchr, __memset_chk, memset, rawmemchr, strlen, strnlen, stpncpy, stpcpy, strcasecmp, strcasecmp_l, strcat, strchr, strchrnul, strrchr, strcmp, strcpy, strcspn, strncasecmp, strncasecmp_l, strncat, strncpy, strpbrk, strspn, strstr, wcschr, wcsrchr, wcscpy, wcslen, wcsnlen, wmemchr, wmemcmp, wmemset, __memcpy_chk, memcpy, __mempcpy_chk, mempcpy, strncmp, __wmemset_chk,
It looks like there is a nice workaround for this implemented in recent versions of glibc: a "tunables" feature that guides selection of optimized string functions. You can find a general overview of this feature here and the relevant code inside glibc in ifunc-impl-list.c.
Here's how I figured it out. First, I took the address being complained about by gdb:
Process record does not support instruction 0xc5 at address 0x7ffff75c65d4.
I then looked it up in the table of shared libraries:
(gdb) info shared
From To Syms Read Shared Object Library
0x00007ffff7fd3090 0x00007ffff7ff3130 Yes /lib64/ld-linux-x86-64.so.2
0x00007ffff76366b0 0x00007ffff766b52e Yes /usr/lib/x86_64-linux-gnu/libubsan.so.1
0x00007ffff746a320 0x00007ffff75d9cab Yes /lib/x86_64-linux-gnu/libc.so.6
...
You can see that this address is within glibc. But what function, specifically?
(gdb) disassemble 0x7ffff75c65d4
Dump of assembler code for function __strcmp_avx2:
0x00007ffff75c65d0 <+0>: mov %edi,%eax
0x00007ffff75c65d2 <+2>: xor %edx,%edx
=> 0x00007ffff75c65d4 <+4>: vpxor %ymm7,%ymm7,%ymm7
I can look in ifunc-impl-list.c to find the code that controls selecting the avx2 version:
IFUNC_IMPL (i, name, strcmp,
IFUNC_IMPL_ADD (array, i, strcmp,
HAS_ARCH_FEATURE (AVX2_Usable),
__strcmp_avx2)
IFUNC_IMPL_ADD (array, i, strcmp, HAS_CPU_FEATURE (SSE4_2),
__strcmp_sse42)
IFUNC_IMPL_ADD (array, i, strcmp, HAS_CPU_FEATURE (SSSE3),
__strcmp_ssse3)
IFUNC_IMPL_ADD (array, i, strcmp, 1, __strcmp_sse2_unaligned)
IFUNC_IMPL_ADD (array, i, strcmp, 1, __strcmp_sse2))
It looks like AVX2_Usable
is the feature to disable. Let's rerun gdb accordingly:
GLIBC_TUNABLES=glibc.cpu.hwcaps=-AVX2_Usable gdb...
On this iteration it complained about __memmove_avx_unaligned_erms
, which appeared to be enabled by AVX_Usable
- but I found another path in ifunc-memmove.h enabled by AVX_Fast_Unaligned_Load
. Back to the drawing board:
GLIBC_TUNABLES=glibc.cpu.hwcaps=-AVX2_Usable,-AVX_Fast_Unaligned_Load gdb ...
On this final round I discovered a rdtscp
instruction in the ASAN shared library, so I recompiled without the address sanitizer and at last, it worked.
In summary: with some work it's possible to disable these instructions from the command line and use gdb's record feature without severe hacks.
I encountered this problem recently as well, and ended up solving it using dynamic CPUID faulting to interrupt execution of the CPUID instruction and override its result, which avoids touching glibc or the dynamic linker. This requires processor support for CPUID faulting (Ivy Bridge+) as well as Linux kernel support (4.12+) for exposing it to userspace through the ARCH_GET_CPUID
and ARCH_SET_CPUID
subfunctions of arch_prctl()
. When this feature is enabled, a SIGSEGV
signal will be delivered on each execution of CPUID, allowing a signal handler can emulate execution of the instruction and override the result.
The full solution is a bit involved since I also need to interpose the dynamic linker, because hardware capability detection was moved there starting with glibc 2.26+. I've uploaded the full solution online at https://github.com/ddcc/libcpuidoverride .