Pattern Recognition and Machine Learning (Bishop) - Exercise 1.28
After some hours of research I've found a few sites which altogether answer these questions.
Regarding items 1 and 2, it looks like there is indeed a severe abuse of notation every time the author refers to function $h$. This function seems to be the so-called self-information and it is usually defined over probability events or random variables as well. I find this article very clarifying in this respect.
Regarding item 4, for what I have seen, it seems that under certain conditions that the self information functions must satisfy, the logarithm if the only possible choice. The selected answer in this post was particularly useful, and also the comments on the question. This topic is also discussed here, but I prefer the previous link.
Finally, I have not found an answer for item 3. Actually, I really think that this step is wrongly formulated due to the imprecision in the definition of function $h$. Nevertheless, the links I have provided as an answer to item 4 lead to the desired result.