What does $-p \ln p$ mean if p is probability?
Let's say you wanted to compress the results of a sequence of independent trials into a sequence of bits.
Then the "ideal" encoding of the result of the trials would have $-\log_2 p_i$ bits for event $i$. This is in the limit, as the number of trials approaches infinity.
Now, what is the expected number of bits per trial? Then, since it is $-\log_2 p_i$ with probability $p_i$, the result is $-\sum p_i\log_2 p_i$. That is, if you want to encode $N$ occurrences of this event, you are going to require, on average, $-N\sum p_i\log_2 p_i$ with your absolutely best encoding.
You can see this most ideally when the $p_i$ are all if the form $\frac{1}{2^{k_i}}$.
For example, if $p_1=1/2, p_2=1/4, p_3=1/4$, then an "ideal" encoding has '0' for event $1$, $10$ for event $2$, and $11$ for event $3$. Then the expected bits per trial is $\frac{1}{2}\cdot 1 + \frac{1}{4}\cdot 2+\frac{1}{4}\cdot 2 = -\sum p_i\log p_i=\frac{3}{2}$. This means, with $N$ trials of this sort, the expected number of bits to store the results will be $\frac{3}{2}N$.
So entropy is also part of what mathematicians call "information theory." That is, the entropy of a system tells you how much (expected) information is needed to describe the results.
Now, if your probabilities are not so nice, then you'd have to encode smarter. For example, if $p_1=p_2=p_3=\frac{1}{3}$, then you wouldn't get "ideal" storage by storing the values one at a time. But, say, if you took five bits at a time, you could store three results, since in $5$ bits, there are $32$ values, and thus you could store any of $27$ results of each roll. In $8$ bits, you can store the result of $5$ trials. In $m$ bits, you can store $\log_3(2^m)$ results. So to store $n$ results, you need $m$ bits with $\log_3(2^m)\geq n$, which is $$m\geq \frac{n}{\log_3 2} = n\log_2 3 = -n\sum p_i\log_2 p_i$$
So $-p_i\log p_i$ is not really the significant thing. The significant thing is storing the result $i$ in $-\log p_i$ bits. In general, if you stored event $i$ as (an average of) $b_i$ bits, then the "expected" number of bits in a single trial would be:
$$\sum p_ib_i$$
It's just that the ideal storage, which minimizes the expected number of bits for a huge number of trials, is $b_i=-\log p_i$.