Why is every address in a micro-controller only 8 bits in size?
There are a few DSPs (e.g., TI C54x) that cannot address values smaller than 16 bits, and some audio DSPs use 24 bits. However, 8-bit values are used in pretty much all general-purpose code, so all general-purpose CPUs support it.
And just because the smalled unit used for memory addresses is 8-bit bytes does not mean that this would be the largest unit that is actually used on the bus; most CPUs use their native word size (16/32 bits) or even a larger size to address memory, and when using byte accesses, automatically extract the byte from the larger word.
For example, the PCI bus always uses 32-bit transactions, but has byte enable signals for access that must be smaller.
A 16-bit or 32-bit microcontroller often needs to manipulate data that is only 8-bits wide (a byte). For example, text strings are usually stored with a single character per byte. By having a memory addressing scheme which allows each individual byte to be addressed the microcontroller can efficiently process 8-bit wide data. What this means is that 32-bit data usually resides on addresses that are multiples of 4 bytes, eg 04, 08, 0C, etc. But if the memory is 32-bit wide then the microcontroller can read 32-bits in one read cycle. Micro's often have machine instructions that can operate on different length data, so you will find that a move data instruction (MOV) can have 3 forms, MOV.B, MOV.W and MOV.L to move 8, 16 and 32 bits of data in one instruction.
This is effectively a design choice, there is no hard reason why it has to be so. Back in the old days, when high volume commodity processors operated on 8-bit values, the mapping was more consistently 1:1. For consistency as designs evolved to modern 32 and 64 bit processors, it made sense to keep the older mapping of byte-addressing even though the data buses increased (with a changing immplementation cost trade-off). Some 32 bit MCUs may still implement only 16 bit data busses to some memory, high-end processors will have 256 bit or above, and are able to load multiple core registers in a single memory transaction. Wide interfaces are good for burst or streaming operations.
The small addressable memory size is useful not only in the case of handling byte values in code, but to work with structures in memory like ethernet packets where specific bytes need to be read or modified. Frequently this sort of operation needs to be able to perform small operations but very efficiently.
There are also scenarios where it is necessary to operate with big-endian, little-endian or mixed endian data. Now there is often dedicated hardware support for this, but again, byte addressing of memory will make this type of operation more efficient in some scenarios.
It is fairly recent that the number of address bits in a register have been a limiting factor for the address space, so wasting 2 bits to address bytes rather than 32 bit words would not have been much of a concern 10-15 years ago (and now with 64 bit pointers, its common to implement 48 or 56 bit wide, byte addresses). Introductory Computer science teaching is still a little stuck in the just-post mainframe age, and doesn't always address the evolution aspects very clearly. Lots of terminology came into use (and definition) around the time that low-volume high cost architectures (in the most general sense) started to be complemented by more resource constrained and more commodity focused processor designs.
I've not answered specifically for MCUs, the architectural boundaries are not as clear as you might assume. Even a modern ground-up MCU design has a good chance of being integrated along with a many-core server processor, or exist as only one point in a scalable set of products; either way a consistent approach to accessing memory is beneficial to the end user who needs to write or port code.
I asked a question on the retrocomputing SE about register sizes to follow up on the historical aspects of this question.