How does the USB protocol work?
I'll try to answer your questions in order that you proposed them (numbering may help).
There are indeed four wires (ignoring USB3.x for the moment). Two indeed for power (+5V and GND), and two for signalling (D+ and D-).
The key thing to note about the signal wires is their name, note the + and - (also sometimes P and N or P and M). These typically indicate in electronics that something is differential. What this means is that the 1's and 0's are indicated by the polarity of the voltage between each cable. This is as opposed to single ended where the 1's and 0's are carried as a voltage relative to GND.
What do I mean by polarity? Well imagine the D+ cable is driven to \$+3.3\mathrm{V}\$, and the D- is driven to \$0\mathrm{V}\$. The difference between the two is \$V_{D+} - V_{D-} = 3.3 - 0 = 3.3\mathrm{V}\$. Now if instead the D+ cable was driven to \$0\mathrm{V}\$ and the D- driven to \$+3.3\mathrm{V}\$, the difference becomes \$V_{D+} - V_{D-} = 0 - 3.3 = -3.3\mathrm{V}\$. Notice the minus sign, indicating the opposite polarity.
For this to work then, the two data cables must be the complement of each other (when one is high, the other is low) to transfer data, thus must operate at the same frequency. You may think why bother, just use one cable. The thing is the world is a rather noisy place, a single ended (common mode) wire is very prone to noise which at high speeds (even at low speeds in harsh environments) which can corrupt the data (make a 1 a 0). In differential signalling, both cables are exposed to the same noise, so it cancels out!
A quick example. Say the signal you are sending is \$2\mathrm{V}\$ or \$0\mathrm{V}\$. Lets also say on each wire you get \$1\mathrm{V}\$ of noise (unrealistic, but an example). For single ended, your signals at the receiver would be either \$2+1=3\mathrm{V}\$ which is clearly a logic 1, or \$0+1=1\mathrm{V}\$ at which point you have no idea what it was. For differential however, your signals at the receiver would be either \$(2+1)-(0+1)=2\mathrm{V}\$ or \$(0+1)-(2+1)=-2\mathrm{V}\$ which are both the same as if there was no noise at all!
There are other advantages to having two wires. When not sending data, the USB spec uses the two wires independently for control signals, e.g. a synchronisation signal, a reset command, etc. These can be clearly distinguished from data packets by various tricks which is probably too in depth (it will all be in the USB spec mentioned in the comments).
In USB, both devices need to be the same frequency. This is usually some multiple of 12MHz - which is why you will see USB devices operating at clock frequencies like 48MHz, or 12MHz, etc. These frequencies are generated by a crystal reference usually which is typically accurate to around 20ppm (240Hz for a 12MHz crystal). This is generally close enough that by virtue of synchronisation pulses on the data lines and the fact that data is sent in bursts (meaning drift in frequency doesn't add up over a long time), that the two devices can stay in sync. If you were to accidentally use, say, a 16MHz crystal, the device will probably fail to enumerate.
How does the master know the device needs power? Simple, all USB devices are allowed to draw an amount of current without requesting it - up to \$100\mathrm{mA}\$ as far as I recall. This gives the device enough power to turn on, assert its presence (with a pull up resistor on the D+ line - again probably too in depth). Once the master is aware of the device, it allocates a power allowance to that device and asks if it will need more (e.g. for a high power device, if it wants the \$500\mathrm{mA}\$ high current allowance).
The USB protocol is all in the spec. Trying to go over would be probably still too in depth. But having said that you might want to check out this answer to another question which gives some details.
USB 1.x (1.5 Mbit/s and 12 Mbit/s) and 2.0 (480 Mbit/s) use four wires, V\$_{BUS}\$ (+5), D+, D- and GND. The D+ and D- form a differential pair. There is also an extension called OTG (On-The-Go) which allows a device to act as either a USB host or device, and this makes use of a fifth ID lead we can ignore.
If the maximum rate of each of the host and device is different, a negotiation takes place and the highest speed common to both is used.
Differential signalling provides for a high degree of noise immunity. The data signals are sent going positive on the D+ line, and negative on the D- line. At the receiving side, the plus lead is sent into the + lead of a comparator (acting as a subtractor), and the minus lead is sent to the - lead of the comparator.
For valid pulses, these two add (since the negative going pulse was sent into the - lead of the comparator, they add). But if noise is induced on the line, it will in general affect both the D+ and D- lines the same, and the comparator will subtract them out.
The diagram shows data going in only one direction. That is because USB signals are half-duplex, the data only goes one direction at a time.
USB 3.0 (up to 10 Gbit/s) uses nine wires instead of four and provides full-duplex operation since there are differential pairs each way for transmits and receive.
Devices can draw up to 100 mA without having to "ask". For USB 1.x and 2.0, the device can ask the host to supply up to 500 mA. For USB 3.0, this has been increased to 900 mA. There is a separate protocol used for battery charging (no data exchange); this can go as high as 5A in some cases.
USB uses differential signaling to reduce interference and allow high-speed transmission over a long distance. A differential bus is built with two wires, one of them representing the transmitted data and the other it's complement. The idea is that the 'average' voltage on the wires does not carry any information, resulting in less interference. Most modern serial buses (USB, Ethernet, PCIe, SATA, etc.) are differential as the cost of the extra wires is far outweighed by the advantages in noise reduction. USB 2 is also what is called 'half duplex.' This means that the link only works in one direction at any given moment. Say, your computer can send a message to your mouse. Or your mouse can send a message to your computer. But not both at the same time. The bus has to get 'turned around' each time data has to be sent in the other direction.
As far as clocking is concerned, USB leverages devices called serializers and deserializers. The serializer is responsible for generating the serial data on the transmit end, and the deserializer is responsible for recovering the bits on the other end. Part of the job of the serializer is encoding the data in a way that it will be easy to decode by including framing and synchronization information. Part of the job of the deserializer is to recover the clock signal. USB uses an encoding called NRZI, or non-return-to-zero inverted. In NRZI, logical 0 is represented by a transition and logical 1 is represented by no transition. USB uses bit stuffing to limit the length of successive 1s sent over the wire. The result of this means that there are transitions often enough to recover all of the bits, even if the clocks on the transmitter and receiver do not tick at exactly the same rate. They do have to be relatively close, though. There are also explicit synchronization sequences that are sent along with the data.
As far as power is concerned, the device is allowed to draw a certain amount of power without asking the host. To draw more than that, it is required to negotiate to make sure that the host does not get overloaded. The device must wait until it gets permission before it can draw more current.