Why can't I MitM a Diffie-Hellman key exchange?
Diffie-Hellman is a key exchange protocol but does nothing about authentication.
There is a high-level, conceptual way to see that. In the world of computer networks and cryptography, all you can see, really, are zeros and ones sent over some wires. Entities can be distinguished from each other only by the zeros and ones that they can or cannot send. Thus, user "Bob" is really defined only by his ability to compute things that non-Bobs cannot compute. Since everybody can buy the same computers, Bob can be Bob only by his knowledge of some value that only Bob knows.
In the raw Diffie-Hellman exchange that you present, you talk to some entity that is supposed to generate a random secret value on-the-fly, and use that. Everybody can do such random generation. At no place in the protocol is there any operation that only a specific Bob can do. Thus, the protocol cannot achieve any kind of authentication -- you don't know who you are talking to. Without authentication, impersonation is feasible, and that includes simultaneous double impersonation, better known as Man-in-the-Middle. At best, raw Diffie-Hellman provides a weaker feature: though you do not know who you are talking to, you still know that you are talking to the same entity throughout the session.
A single cryptographic algorithm won't get you far; any significant communication protocol will assemble several algorithms so that some definite security characteristics are achieved. A prime example is SSL/TLS; another is SSH. In SSH, a Diffie-Hellman key exchange is used, but the server's public part (its gb mod p) is signed by the server. The client knows that it talks to the right server because the client remembers (from a previous initialization step) the server's public key (usually of type RSA or DSA); in the model explained above, the rightful servers is defined and distinguished from imitators by its knowledge of the signature private key corresponding to the public key remembered by the client. That signature provides the authentication; the Diffie-Hellman then produces a shared secret that will be used to encrypt and protect all the data exchanges for that connection (using some symmetric encryption and MAC algorithms).
Thus, while Diffie-Hellman does not do everything you need by itself, it still provides a useful feature, namely a key exchange, that you would not obtain from digital signatures, and that provides the temporary shared secret needed to encrypt the actually exchanged data.
Tom has provided a good explanation as to why Diffie-Hellman cannot be safe against man-in-the-middling. Now this answers the OP's original question but probably leaves some readers with the (reasonable) follow-up question: Why don't we just use public-key (asymmetric) cryptography to ensure the confidentiality of our messages, and drop D-H altogether? There are some reasons not to do this:
- There are algorithms that support only signing, but not encrypting messages (ECDSA, for example)
- Symmetric encryption and decryption is a lot faster than doing it asymmetrically
- Probably most important is that we want to ensure forward secrecy. After all, it's not impossible that the private key of one of your communication partners is compromised at some point. Now if you only relied on asymmetric encryption, all messages you ever sent to that partner could be decrypted by the attacker in retrospect. In contrast, if we use Diffie-Hellman - and to be precise, ephemeral Diffie-Hellman, we generate a new D-H key pair for each communication session and throw it away ( = do not store it) afterwards, meaning it is impossible to decrypt our messages at a later time.
After a DH key exchange, both parties know what key they've computed. If no man-in-the-middle has infiltrated the connection, both parties will have the same key. If the connection has been breached, they will have different keys. If there is a means by which one party can ask the other what key it is using, the man in the middle will only be able to remain undetected if it is able to respond in the same fashion as the legitimate party would have done. While the question is often answered using a digital signature, to make impersonation difficult, it the question may also be asked/answered via things like voice communication. If a voice application shows the participants the current encryption key and a participant arbitrarily selects a range and a popular movie star (e.g. Marilyn Monroe), and asks the other to read the fifteenth through twenty-fifth digits in their best Marilyn Monroe voice, a real participant who has the numbers in front of him would be able to do so quickly and fluently and, in the absence of an MITM attack, the digits would match those seen by the first party. A man-in-the-middle attacker would have no problem detecting the question, and--given time--might be able to forge a voice file of the legitimate communicant doing a bad imitation of Marilyn Monroe saying the appropriate digits, but would have a hard time doing that as quickly as the real one.
In short, DH by itself can be robust against MITM attacks if there each participant knows something the other participant will be able to do with a number more efficiently than an attacker. Other protocols are generally used in conjunction with DH, however, because it is useful for the authentication process to be automated, and most forms of authentication that wouldn't rely upon encryption (things like voice, phrasing, etc.) require human validation. Further, it's often necessary for entities to solicit communication from strangers. If want to talk to an Acme Bank representative, a man-in-the-middle impostor could set up a fake "Acme Bank" office and take my call, and have someone else in a phony living room relay everything I say to the real Acme Bank, and nobody would be the wiser. If I have no idea how well or poorly a real Acme Bank employee would be able to imitate Marilyn Monroe, I'd have no way of knowing that an impostor's imitation wasn't the same.