Why are maximum likelihood estimators used?
The principle of maximum likelihood provides a unified approach to estimating parameters of the distribution given sample data. Although ML estimators $\hat{\theta}_n$ are not in general unbiased, they possess a number of desirable asymptotic properties:
- consistency: $\hat{\theta}_n \stackrel{n \to \infty}{\to} \theta$
- normality: $ \hat{\theta}_n \sim \mathcal{N}( \theta, \Sigma )$, where $\Sigma^{-1}$ is the Fisher information matrix.
- efficiency: $\mathbb{Var}(\hat{\theta}_n)$ approaches Cramer-Rao lower bound.
Also see Michael Hardy's article "An illuminating counterexample" in AMM for examples when biased estimators prove superior to the unbiased ones.
Added
The above asymptotic properties hold under certain regularity conditions. Consistency holds if
- parameters identify the model (this ensure existence of the unique global maximum of the log-likelihood function)
- parameter space of the model is compact,
- log-likelihood function is continuous function of parameters for almost all $x$,
- log-likelihood is dominated by an integrable function for all values of parameters.
Asymptotic normality holds if
- the estimated parameters are away from the boundary of the parameter domain,
- distribution domain does not depend on distribution parameters $\theta$,
- the number of nuisance parameters does not depend on the sample size
Unbiasedness is overrated by non-statisticians. Sometimes unbiasedness is a very bad thing. Here's a paper I wrote showing an example in which use of an unbiased estimator is disastrous, whereas the MLE is merely bad, and a Bayesian estimator that's more biased than the MLE is good.
Direct link to the pdf file: http://arxiv.org/pdf/math/0206006.pdf
(Now I see Sasha already cited this paper.)