is Shannons "A mathematical theory of communication" worth reading for a beginner in information theory?
Nothing is incompatible in Shannon's paper, though much has been cleaned up and streamlined* - C&T is perhaps one of the best at this. You have to keep in mind though that Shannon wrote a paper - these are never as easy to read as a book if one is not used to them. That said the paper is wonderful. Each time I read it I feel as if I've understood things better, and see them a little differently. Definitely pay close attention to the exposition in it whenever you do read it.
Be warned that the following is necessarily speculative, and a little unrelated to your direct question.
The reason that C&T don't go into why entropy is defined the way it is in Ch. 2 is philosophical. Usually, (and this is the 'incentive' of Shannon that you mention) the justification for entropy is that there are a few natural properties that one wants from a measure of information - key are continuity and that the 'information' of two independent sources is the sum of their individual 'infromation's, and once one posits these axioms, it is a simple theorem that entropy is the unique functional (up to scalar multiplication) that satisfies these.
However, there's a (large**) school in information theory that rejects the above's centrality. It argues that the utility of any information measure is in what operational consequences it has. (This, I think, arises from the fact that the origins - and practice! - of information theory are very much in engineering, and not mathematics, and we engineers, even fairly mathematical ones, are ultimately interested in what one can do with wonderful maths and are not content with how wonderful it is.***). According to this view, the basic reason we define entropy, and that it is such a natural object, is Asymptotic Equipartition (and other nice properties). So you'll find that much of chapter 2 is a (relatively) dry development of various facts about various information measures (except maybe the stuff on Fano's inequality, which is more directly applicable), and that the subject really comes alive in Chapter 3. I'd suggest reading that before you give up on the book - maybe even skip ahead and then go back to Ch. 2.
I'd argue that Cover and Thomas subscribe to the above view. See for instance, the concluding sentences of the introduction to Ch. 2:
In later chapters we show how these quantities arise as natural answers to a number of questions in communication, statistics, complexity, and gambling. That will be the ultimate test of the value of these definitions.
and the following from the bottom of page 14 (in the 2nd edition), a little after entropy is defined (following (2.3) ):
It is possible to derive the definition of entropy axiomatically by defining certain properties that the entropy of a random variable must satisfy. This approach is illustrated in Problem 2.46. We do not use the axiomatic approach to justify the definition of entropy; instead, we show that it arises as the answer to a number of natural questions, such as “What is the average length of the shortest description of the random variable?”
*: and some of the proofs have been brought into question from time to time - I think unfairly
**: I have little ability to make estimates about this, but in my experience this has been the dominant school.
***: The practice of this view is also something that Shannon exemplified. His 1948 paper, for instance, sets out to study communication, not to establish what a notion of information should be like - that's 'just' something he had to come up with on the way.
Shannon's paper is certainly brilliant, relevant today and readable. It's worth reading, but I would not recommend it as an introduction to information theory. For learning, I would stick with Cover & Thomas or any other modern textbook.