How CPUs implement Instructions like MUL/MULT?
This page shows the logic gates for a 4 * 4 combinational multipler. You can work up from there.
Here is somebody's lab where they describe building a 16 bit multiplier from 4 4 bit multipliers, each built with AND gates and full adders. Full design, chip layout, and simulation waveforms.
http://en.wikipedia.org/wiki/Multiplication_ALU on Wikipedia lists different methods for doing multiplication in a digital circuit.
When I worked on a project to add SIMD instructions to an DEC Alpha-like processor in Verilog back in college, we implemented a Wallace tree multiplier, the primary reason being it ran in a fixed number of cycles and was easy to pipeline.
Apparently, Dadda multipliers are (nearly?) universally used in real life CPU ALUs, including modern x86. Like Wallace multipliers, it can also be pipelined with fixed latency.
EDIT: You mentioned using the other bit fiddling instructions, on modern processors multiplication would not be microcoded like this; it'd be way to slow and the processor would get slaughtered in benchmarks.