Why does maximum likelihood estimation work the way that it does?
Suppose you are trying to maximize the likelihood of i.i.d. data $x_1, x_2, \ldots, x_n$ with p.d.f. $f(x)$ and parameter vector $\theta$.
The joint probability of the data given $\theta$ is $$ f(x_1, x_2, \ldots, x_n | \theta) = \prod_i f(x_i | \theta) $$.
The goal is to find $\theta$ which maximizes the joint probability of $x_1, x_2, \ldots, x_n$. Remember, we don't know $\theta$ yet, but we can being the process of estimating it by defining the likelihood function $$ l(\theta|x_1, x_2, \ldots, x_n) = \prod_i f(x_i | \theta)$$.
If you imagine the likelihood function as a unimodal p.d.f., the point which maximizes the likelihood of the data is at the very peak of the hump. We can use calculus to find this point because the maximum of a function has two properties: (1) its derivative is zero and (2) the second derivative is negative, or, $$\frac{\partial}{\partial \theta} l(\theta|x_1, x_2, \ldots, x_n) = 0 $$ and $$ \frac{\partial^2}{\partial \theta^2} l(\theta|x_1, x_2, \ldots, x_n) < 0 $$.
Solving the first order condition for $\theta$ yields a functional form for an estimator maximizing the joint likelihood of the data $x_1, x_2, \ldots, x_n$.
It is good to note these are sufficient conditions for a local maximum. If the p.d.f. is not unimodal as we imagined, then we cannot guarantee our estimate is the maximum likelihood estimator. Generally, numerical methods can be used to explore the likelihood function for a global maximum when its identification is not obvious.