Intuition of information theory
I believe that focusing on this first part of your question could be a good starting point.
If we are dealing with a given process $X$, then we would like to better comprehend and characterize it. The mutual information is a measure of uncertainty reduction of our knowledge of the process $X$ when a second process, let us say $Y$, is available. If $X$ and $Y$ are independent, then knowing $Y$ would give us no extra information on $X$, and no uncertainty reduction would occur. On the contrary, if $X$ and $Y$ are somehow related, then information from $Y$ is useful to "better define" the original process $X$.
The mutual information formalized the above statements: $I(X,Y)=0$ if $X$ and $Y$ are independent as, in this case, $H(X|Y)=H(X)$. We have no improvement on the knowledge of $X$.
If $X$ and $Y$ are not independent, then $I(X,Y)>0$ by Jensen's inequality: we have an uncertainty reduction and the knowledge of $Y$ is useful to better understand $X$.
In this framework the "absolute" uncertainty of a given process, let us say $X$, is denoted by its entropy $H(X)$. If this concept sounds not so clear, I would suggest to read the wiki page on self information and "surprisal":
http://en.wikipedia.org/wiki/Surprisal
Note that mutual information is not a distance in the pure mathematical sense: it is a measure of "distance", or a distance like function. If you want to define the distance between processes $X$ and $Y$ you need to introduce the Variation of information.
This last fact can be a bit "disturbing/annoying": why am I suppose to talk about distances when I use functions that are not distances themselves? A related topic is given by the use of divergences (which are related to mutual information) vs. distances in information geometry.
The Cover and Thomas book is a very good textbook. If you are interested in the geometry behind information theory you can read "Methods of Information Geometry" by Amari and Nagaoka.
If you are interested in applications of entropies and reduction of uncertainty, why not to consult the book "Inroduction to Clustering Large and High-Dimensional Data" by Kogan? Chapters 6-7-8 provide useful applications.
Since you have an intuitive understanding of entropy based on the compression theorem, you should look into the operational meaning of mutual information, which is the channel coding theorem. It says if you have a noisy channel with a joint distribution $p(X,Y)$, then it can transmit information encoded in $X$ to a receiving party with access to $Y$ at a rate of $I(X;Y)$ bits per symbol.
Christopher Olah wrote an excellent intuitive explanation of Information Theory called - Visual Information Theory. It provides thougtful visualizations for understanding these concepts.
In addition there was a paper that introduced a tool for visualizing mutual information called The Mutual Information Diagram for Uncertainty Visualization that may be useful.