What is information theoretic entropy and its physical significance?
The entropy of a message is a measurement of how much information it carries.
One way of saying this (per your textbook) is to say that a message has high entropy if each word (message sequence) carries a lot of information. Another way of putting it is saying that if we don't get the message, we lose a lot of information; i.e., entropy is a measure of the number of different things that message could have said. All of these definitions are consistent, and in a sense, the same.
To your first question: the entropy of each letter of the English language is about two bits, as opposed to a Hindi letter which apparently contains $3$.
The question this measurement answers is essentially the following: take a random sentence in English or Hindi, and delete a random letter. On average, how many possible letters might we expect to be in that blank? In English, there are on average $2$ possibilities. In Hindi, $3$
EDIT: the simplest way to explain these measurements is that it would take, on average, $2$ yes/no questions to deduce a missing english letter and $3$ yes/no question to deduce a missing Hindi letter. On average, there are in fact twice as many Hindi letters (on "average", you'd have $2^3=8$ letters) that can fill in a randomly deleted letter in a Hindi passage as the number of English letters (on "average", you'd have $2^2=4$ letters). See also Chris's comment below for another perspective.
For a good discussion of this stuff in the context of language, I recommend taking a look at this page.
As for (2), I don't think I can answer that satisfactorily.
As for (3), there's a lot to be done along the same lines of language. Just as we measure the entropy per word, we could measure the entropy per musical phrase or per base-pair. This could give us a way of measure the importance of damaged/missing DNA, or the number of musically appealing ways to end a symphony. An interesting question to ask about music is will we ever run out? (video).
Password strength comes down to the following question: how many passwords does a hacker have to guess before he can expect to break in? This is very much answerable via entropy.
I hope that helps.
Regarding number 2:
If you can compress a message, it means that it can be conveyed in a shorter way, meaning some of the bits are not needed. In the compressed form, the message will contain the same amount of information using less bits, so it will have lower entropy (now bits are more likely to be important to convey the message).
The "disorder" you refer to isn't disorder in a physical sense: it's a fluffy way of talking about randomness. Chemists and physicists talk about entropy a lot, meaning how spread-out or randomly distributed is the energy in a system. It's related mathematically to the information-theoretic sense of entropy, but of course you need to think in terms of different analogies.
So, now think about randomness instead of disorder. A random sequence has high entropy because, unlike English, it's very difficult to guess the next symbol/number/letter in the sequence. When you compress data, you try to reduce the redundancies. This reduces the entropy, because it's then harder to guess the next symbol. It also makes the data look more like random data. The more compressed the data are, the more random they look.
Similarly, a randomly chosen password is hard to guess. There are many equally likely possibilities for the password: it has high entropy. But if the password is much more likely to be a dictionary word, it has lower entropy, because some possibilities are much more likely than others.
To make a simpler example, let's take a "password" that consists of a single digit 0-9. If the password is equally likely to be any digit, then the Shannon entropy $-\sum{p_i \log p_i}$ comes out as $-10\times(0.1 \log 0.1) \approx 3.3 \text{ bits} $.
Now, let's say people choose the prettiest digit. Half of the time, they choose 0, and the other half of the time, they choose one of the other digits at random. That is, one outcome has probability $\tfrac{1}{2}$, and nine outcomes have probability $\tfrac{1}{9\times2}$. This password is much easier to guess: if you guess 0, you'll be right half of the time. And this time, the Shannon entropy comes out as about 2.58 bits. This is lower, reflecting how much easier the password is to guess.
Of course, just like randomness, entropy depends on how you're modelling the input: that is, what probability you think each symbol has. If an attacker didn't know that the password was 0 half the time, he'd still find it just as hard to guess as a random password.