Instruction decoding when instructions are length-variable
I won't be able to answer how exactly they are decoded, but I can answer why they are of variable length.
The reason for variable length is both due to the desire to keep code size small as well as unforeseen instruction set extensions.
Reducing Instruction Size
Some instructions (by nature), need a lot more space to encode. If all instructions were set at a sufficiently large fixed length to accommodate these, there would be a lot of wasted space in the instruction code. Variable length instructions allows instructions to be "compressed" down to a smaller size.
(Unforeseen) Instruction Set Extensions
The other reason is instruction set extensions. Originally, x86 only had 256 opcodes. (1 byte) Then there was a need to add more instructions, so they threw out one instruction and used its opcode as an escape character for new opcodes. The result is that the newer instructions were longer. But it was the only way to extend the instruction set and maintain backward compatibility.
As for how the processor decodes these, it's a complicated process. For each instruction, the processor needs to find the length and decode from there. This leads to an inherently sequential decoding process that is a common performance bottleneck.
Modern x86 processors have what is called a uop (micro-op) cache that caches the decoded instructions into something that's more manageable (and RISC-like) by the processor.
You have reinvented RISC
Hmm, the objection you are making to the classic x86 (see CISC) is exactly what motivated the designers of the RISC CPU architectures to create simple, aligned, fixed-size instruction set architectures.
It turns out that x86 these days does in fact translate the user-visible ISA to a more RISC-like micro-op stream that lives in an internal cache.
Good observation.
Notes.
1. Micro-ops are just one available technique. In the general case, as long as the decoding and alignment of instructions takes place in one or more pipeline stages, the actual time taken will not be added to the average instruction execution time. If branch prediction is working and the pipeline is kept full, the extra time it takes to decode and align the instructions is handled by logic executing in parallel with the actual instruction operations. With millions of gates available to today's designers they can dedicate a lot of logic to decoding the complex x86 ISA.
2. You mentioned the memory bus width; it turns out that the memory path is typically greater than either 32 or 64-bits, also. The architectural word size simply refers to the ALU and pointer size. The actual width of memory and cache interfaces is often 2x or 4x the architectural word size.
EDIT: hoping to make it more readable.
The hardware does not look at the memory as a long list of unorganized bytes. All processors, fixed or variable word length, have a specific boot method. Usually a known address in the processors memory/address space with either an address to the first instruction of the boot code or the first instruction itself. From there and for every instruction the address of the current instruction is where to start decoding.
For an x86 for example it has to look at the first byte. Depending on the decoding of that byte it may need to read more opcode bytes. If the instruction required an address, offset or otherwise some form of immediate value those bytes are there as well. Very quickly the processor knows exactly how many bytes are in this instruction. If the decoding shows the instruction contains 5 bytes and it started at address 0x10 then the next instruction is at 0x10+5 or 0x15. This continues forever. Unconditional branches, which depending on the processor can come in various flavors, you dont assume the bytes that follow the instruction are another instruction. Branches, conditional or unconditional give you a clue where another instruction or series of instructions start in memory.
Note the X86 today definitely does not fetch one byte at a time when it decodes an instruction, sensible sized reads occur, probably 64 bits at a time, and the processor will pull the bytes out of that as needed. When reading a single byte from a modern processor the memory bus still does a full sized read and either presents all of those bits on the bus where the memory controller only pulls the bits it was after or it may go so far as to keep that data. You will see some processors where you may have two 32 bit read instructions at back to back addresses but only one 64 bit read happens on the memory interface.
I highly recommend you write a disassembler and/or an emulator. For fixed length instructions it is pretty easy, you just start at the beginning and decode as you go through memory. A fixed word length disassembler may help learning about decoding instructions, which is part of this process, but it wont help your understanding of following variable word length instructions and how to separate them without getting out of alignment.
The MSP430 is a good choice as a first disassembler. There are gnu tools asm and C, etc (and llvm for that matter). Start with assembler then C or take some pre-made binaries. They key is you have to walk the code like the processor, start with the reset vector and walk your way through. When you decode one instruction you know its length and know where the next instruction is until you hit an unconditional branch. Unless the programmer has intentionally left a trap to fool the disassembler, assume that all branches conditional or unconditional point at valid instructions. An afternoon or evening is all that is required to bang out the whole thing or at least get the concept. You dont necessarily need to completely decode the instruction, dont have to make this a full blown disassembler, only need to decode enough to determine the length of the instruction and determine if it is a branch and if so where. Being a 16 bit instruction you can, if you choose, one time build a table of all possible instruction bit combinations and what their lengths are, that may save some time. You still have to decode your way through branches.
Some folks might use recursion, instead I use a memory map showing me what bytes are the start of an instruction, what bytes/words are part of an instruction but not the first byte/word and what bytes I have not decoded yet. I start by taking the interrupt and reset vectors and using those to mark the starting point for instructions. and then go into a loop that decodes the instructions looking for more starting points. If a pass happens with no other starting points then I have finished that phase. If at any point I find an instruction starting point that lands in the middle of an instruction there is a problem that will require human intervention to solve. Disassembling old video game roms for example you are likely to see this, hand written assembler. Compiler generated instructions tend to be very clean and predicable. If I get through this with a clean memory map of instructions and what is left over, (assume data) I can make one pass knowing where instructions are and decode and print those out. What a disassembler for variable word length instructions sets can never do is find every instruction. If the instruction set has for example a jump table or otherwise some sort of runtime computed address for executing, you wont find all of those without actually executing the code.
There are a number of existing emulators and disassemblers out there, if you want to try to follow along instead of writing your own, I have a few myself http://github.com/dwelch67.
There are pros and cons for and against variable and fixed word length. Fixed has advantages sure, easy to read easy to decode, everything is nice and proper, but think about ram, the cache in particular, you can cram a lot more x86 instructions in the same cache as an ARM. On the other hand an ARM can decode much easier, far less logic, power, etc more bang for your buck. Historically memory was expensive logic was expensive, and a byte as you go was how it worked. a single byte opcode limited you to 256 instructions, so that expanded into some opcodes needing more bytes not to mention the immediates and addresses that made it variable word length anyway. Keep reverse compatibility for decades and you end up where you are now.
To add to all of this confusion ARM for example now has a variable word length instruction set. Thumb had a single variable word instruction, the branch, but you can easily decode this as fixed length. But they created thumb2 which really does resemble a variable word length instruction set. Also many/most of the processors that support the 32 bit ARM istructions also support 16 bit thumb instructions, so even with an ARM processor you cannot simply align the data by words and decode as you go, you have to use variable word length. What is worse the ARM to/from thumb transitions are decoded by executing, you normally cannot simply disassemble and figure out arm from thumb. A mixed mode compiler generated branch often involves loading a register with the address to branch to then using a bx instruction to branch to it, so the disassembler would need to look at the bx, look backwards in execution for the register used in the branch and hope you find a load there and hope that is the .text segment it is loading from.
There are very good reasons to have a fixed instruction length, implementation simplicity being the most important. That's why many processors do indeed have a fixed instruction length, like RISC processors and many early computers.
CISC instruction sets like x86 are desinged to be decoded sequentially (step by step) by microcode. (you can think of microcode as a kind of interpreter for CISC instructions) That was the state of the art in the early 80s when x86 was designed.
Nowadays this is a problem, because microcode is dead. x86 instructions are now broken into smaller µ-ops, not unlike RISC instructions. But to do so, the x86 instructions have to be decoded first. And current CPUs decode up to 4 instructions each cycle. Because there is no time to decode sequentially one instruction after another, this works simply by brute force. When a line is brought in from the instruction cache, many decoders decode the line in parallel. One instruction decoder at each possible byte offset. After the deocoding, the length of each instruction is known, and the processor decides which decoders actually provide valid instructions. This is wasteful, but very fast.
Variable instruction sizes introduce more headakes, e.g. an instruction can span two cache lines or even two pages in memory. So your observation is spot on. Today nobody would design a CISC instruction set like x86. However, some RISCs have recently introduced a second instruction size to get more compact code: MIPS16, ARM-Thumb, etc.