which CPUs support MOVBE instruction?
This instruction was originally unique to the Intel® Atom™ processor.
From Intel side:
The Intel® Compilers 11.0 allow you to target the Intel® Atom™ processor using the /QxSSE3_ATOM or -xSSE3_ATOM compiler options. These options enable the generation of the movbe instruction which is unique to the Intel® Atom™ processor.
In other microarchitectures (http://instlatx64.atw.hu/ with uop info from https://agner.org/optimize/):
- Mainstream Intel: Haswell and later. Including Haswell Xeon (Ex-xxxx v3).
Decodes as 2 or 3 uops, about the same asbswap
+ load or store. - Mainstream AMD: Excavator, and Ryzen-family. Steamroller and earlier don't have it.
Decodes efficiently to a single uop.
non-mainstream CPUs:
- Legacy in-order Intel Atom: all
- Intel Silvermont-family out-of-order Atom: all. Decodes efficiently to a single uop.
AMD Jaguar. Decodes efficiently to a single uop.
Intel Xeon Phi: Knight's Landing (based on Silvermont) and later. (Maybe not on Knight's corner.)
It appears that all Atom processors support MOVBE; at any rate, the first and least capable (the Atom 230) does. (See e.g. http://www.linuxquestions.org/questions/linux-hardware-18/proc-cpuinfo-output-816192/ for evidence.) I don't believe any non-Atom Intel processors support MOVBE; at any rate, recent Core i7 processors appear not to (see e.g. http://www.techsupportforum.com/forums/f108/i7-running-on-3-of-8-threads-522063.html and search for "movbe" for evidence).
You can detect MOVBE support at runtime using CPUID.