Intuitive explanation of a definition of the Fisher information

From the way you write the information, it seems that you assume you have only one parameter to estimate ($\theta$) and you consider one random variable (the observation $X$ from the sample). This makes the argument much simpler so I will carry it in this way.

You use the information when you want to conduct inference by maximizing the log likelihood. That log-likelihood is a function of $\theta$ that is random because it depends on $X$. You would like to find a unique maximum by locating the theta that gives you that maximum. Typically, you solve the first order conditions by equating the score $\frac{\partial\ell \left( \theta ; x \right)}{\partial \theta} = \frac{\partial\log p \left( x ; \theta \right)}{\partial \theta}$ to 0. Now you would like to know how accurate that estimate is. How much curvature the likelihood function around its maximum is going to give you that information (if it's peaked around the maximum, you are fairly certain, otherwise if the likelihood is flat you are quite uncertain about the estimate). Probabilistically, you would like to know the variance of the score "around there" (this is heuristic and a non-rigorous argument. You could actually show the equivalence between the geometric and probabilistic/statistical concepts).

Now, we know that on average, the score is zero (see proof of that point at the end of this answer). Thus \begin{eqnarray*} E \left[ \frac{\partial \ell \left( \theta ; x \right)}{\partial \theta} \right] & = & 0\\ \int \frac{\partial \ell \left( \theta ; x \right)}{\partial \theta} p \left( x ; \theta \right) d x & = & 0 \end{eqnarray*} Take derivatives at both sides (we can interchange integral and derivative here but I am not going to give rigorous conditions here) \begin{eqnarray*} \frac{\partial}{\partial \theta} \int \frac{\partial \ell \left( \theta ; x \right)}{\partial \theta} p \left( x ; \theta \right) d x & = & 0\\ \int \frac{\partial^2 \ell \left( \theta ; x \right)}{\partial \theta^2} p \left( x ; \theta \right) d x + \int \frac{\partial \ell \left( \theta ; x \right)}{\partial \theta} \frac{\partial p \left( x ; \theta \right)}{\partial \theta} d x & = & 0 \end{eqnarray*}

The second term on the left-hand side is \begin{eqnarray*} \int \frac{\partial \ell \left( \theta ; x \right)}{\partial \theta} \frac{\partial p \left( x ; \theta \right)}{\partial \theta} d x & = & \int \frac{\partial \log p \left( x ; \theta \right)}{\partial \theta} \frac{\partial p \left( x ; \theta \right)}{\partial \theta} d x\\ & = & \int \frac{\partial \log p \left( x ; \theta \right)}{\partial \theta} \frac{\frac{\partial p \left( x ; \theta \right)}{\partial \theta}}{p \left( x ; \theta \right)} p \left( x ; \theta \right) d x\\ & = & \int \left( \frac{\partial \log p \left( x ; \theta \right)}{\partial \theta} \right)^2 p \left( x ; \theta \right) d x\\ & = & V \left[ \frac{\partial \ell \left( \theta ; x \right)}{\partial \theta} \right] \end{eqnarray*}

(here the second follows from dividing and multiplying by $p(x;\theta)$. The third line follows from applying the chain rule to derivative of log. The final line follows from the expectation of the score being zero, that is the variance is equal to the expectation of the square and no need to subtract the square of the expectation.)

From which you can see

\begin{eqnarray*} V \left[ \frac{\partial \ell \left( \theta ; x \right)}{\partial \theta} \right] & = & - \int \frac{\partial^2 \ell \left( \theta ; x \right)}{\partial \theta^2} p \left( x ; \theta \right) dx\\ & = & - E \left[ \frac{\partial^2 \ell \left( \theta ; x \right)}{\partial \theta^2} \right] \end{eqnarray*}

Now you could see why summarizing uncertainty (curvature) about the likelihood function takes the particular formula of Fisher information.

We can even go further and prove that the maximum likelihood estimator best possible efficiency is given by the inverse of the information (this is called the Cramér-Rao lower bound).


To answer an additional question by the OP, I will show what the expectation of the score is zero. Since $p \left( x, \theta \right)$ is a density \begin{eqnarray*} \int p \left( x ; \theta \right) \mathrm{d} x & = & 1 \end{eqnarray*} Take derivatives on both sides \begin{eqnarray*} \frac{\partial}{\partial \theta} \int p \left( x ; \theta \right) \mathrm{d} x & = & 0 \end{eqnarray*} Looking on the left hand side \begin{eqnarray*} \frac{\partial}{\partial \theta} \int p \left( x ; \theta \right) \mathrm{d} x & = & \int \frac{\partial p \left( x ; \theta \right)}{\partial \theta} \mathrm{d} x\\ & = & \int \frac{\frac{\partial p \left( x ; \theta \right)}{\partial \theta}}{p \left( x ; \theta \right)} p \left( x ; \theta \right) \mathrm{d} x\\ & = & \int \frac{\partial \log p \left( x ; \theta \right)}{\partial \theta} p \left( x ; \theta \right) \mathrm{d} x\\ & = & E \left[ \frac{\partial \ell \left( \theta ; x \right)}{\partial \theta} \right] \end{eqnarray*} Thus the expectation of the score is zero.


This was a non-rigorous exposition. I recommend you follow on the arguments here in a very good textbook on statistical inference. (I personally recommend the book by Casella and Berger but there are many other excellent books.)


From Wikipedia :

[Fisher] Information may be seen to be a measure of the "curvature" of the support curve near the maximum likelihood estimate of θ. A "blunt" support curve (one with a shallow maximum) would have a low negative expected second derivative, and thus low information; while a sharp one would have a high negative expected second derivative and thus high information.

P(θ;X) is the probability mass function of random observable X conditional on the value of θ. The Fisher Information is a way of measuring the amount of information X carries about the unknown parameter, θ. Thus, in light of the above quote, a strong, sharp support curve would have a high negative expected second derivative, and thus a larger Fisher information, intuitively, than a blunt, shallow support curve, which would express less information through X about θ.


All these are correct, but they do not explain why we need to look at the curvature (Hessian) of the log-likelihood instead of the likelihood.

Put very informally: asymptotic normality states that the distribution of MLE estimators around mode is close to the likelihood (or mimics curvature of the likelihood) as number of samples approaches to infinity. The shape of the distribution gets close to a normal distribution centered on that mode and has the same curvature as the likelihood (NOT log-likelihood) at the mode.

I think the ideas behind asymptotic normality and Laplace approximation are intimately related. It is almost similar to an Laplace approximation around the mode of the likelihood.