The Lagrangian in Scalar Field Theory
I have a slightly different perspective from the other two answers which provides a more elementary motivation. Suppose you know nothing about renormalizability or energy-momentum relations and all you know is that a Lagrangian density is a function of fields and their derivatives that transforms as a scalar under Poincaré transformations.
You can motivate the Klein-Gordon equation by asking what is the simplest Lagrangian you can write down for a scalar field that transforms as a scalar and provides a positive-definite Hamiltonian.
Since we're dealing with scalar fields any polynomial function of the fields $\phi$ will satisfy the correct Lorentz transformation property. So you could write down a term like $a\phi+b\phi^2$ with real constants $a$ and $b$. Now we also want to include derivatives $\partial_\mu\phi$. In order to satisfy the correct Lorentz transformation properties we need to contract this with a term $\partial^\mu\phi$.
So the simplest Lagrangian we can write down is $\mathcal{L}=c\partial_\mu\phi\partial^\mu\phi+a\phi+b\phi^2$ from which we obtain a Hamiltonian
$\mathcal{H}=\frac{\pi^2}{4c}+c\partial_i\phi\partial_i\phi-a\phi-b\phi^2$
The $a\phi$ term is not nice since it ruins the positive-definiteness of the Hamiltonian, so set $a=0$. A scalar field and the derivatives both have dimension of $[mass]^2$ and the Lagrangian density has dimension $[mass]^4$, so $c$ should be dimensionless and $b$ should be $b=-m^2$ where $m$ has units of mass and the minus sign is there to make the Hamiltonian positive-definite.
So we've reduced our Hamiltonian to
$\mathcal{H}=\frac{\pi^2}{4c}+c\partial_i\phi\partial_i\phi+m^2\phi^2$
Setting $c=1/2$ and rescaling $m^2\rightarrow m^2/2$ means the coeffecients of all terms are the same.
Hence the Lagrangian densiy is $\mathcal{L}=\frac{1}{2}\partial_\mu\phi\partial^\mu\phi-\frac{1}{2}m^2\phi^2$
*Things you could try and argue against this being the simplest scalar field Lagrangian;
- If the point is simplicity, why not just ignore the derivate terms and write a Lagrangian for a scalar field as $\mathcal{L}=-m^2\phi^2$? Because if you ignore the derivative terms the field equation is $\phi=0$, and who cares about that? Ignoring the derivatives results in a non-dynamical field. So the Klein-Gordon Lagrangian is the simplest you can write down where something actually happens.
Of course, you get a simpler valid Lagrangian by setting $m=0$, but this isn't done as books want to show the energy-momentum relation in a general setting when you quantize the field. However, you can start with the massless case in 5 dimensions and perform dimensional reduction to obtain the massive case in 4 dimensions.
- Why ignore the possibility of field-derivative interaction terms? You can do this, but the goal is simplicity, and the simplest term coupling the field to its derivatives, transforming correctly and yielding a positive definite Hamiltonian is $\phi\phi\partial_\mu\phi\partial^\mu\phi$, which is much more complicated than our other terms.
A reasonable motivation for that Lagrangian can be found making a classical analogy. The kinetic energy in classical physics is proportional to the square of the rate of change of the position with time so: $$ T=\frac{1}{2}(\partial_0 \phi)^2 $$ But in order to get a Lorentz invariant Lagrangian we must add: $$ -\frac{1}{2}\sum_i{(\partial_i \phi)^2} $$ Adding and using the Minkowsi metric: $$ \frac{1}{2}\eta^{\mu\nu}\partial_{\mu} \phi\partial_{\nu} \phi $$
Now, suppose the equilibrium value of the field is $\phi=0$. For a simple harmonic oscillator with equilibrium position $x=0$ the potential energy goes like $\sim x^2$. If we want the field to prefer its equilibrium state, then this must be encoded in the potential and the simplest one which does so is harmonic: $$ V=\frac{1}{2}m^2\phi^2 $$
Combining both terms, we have: $$ \mathcal{L}=\frac{1}{2}\eta^{\mu\nu}\partial_{\mu} \phi\partial_{\nu} \phi-\frac{1}{2}m^2\phi^2 $$
Source: B. Zwiebach, "A First Course in String Theory", chapter 10.2
The standard motivation, as QuantumDot explained, is to reproduce the energy momentum relation of relativity from the scalar field. But there is an independent argument for this which comes from statistical mechanics.
Consider a statistical mechanical partition function for a field defined on a very fine lattice. In general, the equilibrium will only allow local fluctuations, so that the field will have a probability distribution at any lattice point which is locally independent of the fluctuations at any other distant lattice point. In this case, you can look at the lattice on coarse distance scales, and define an average field $\phi$ over many lattice spacing,s and write the partition function as a product of independent partition functions at each point of the coarse lattice:
$$ Z = \prod \int e^{V(\phi)} d\phi$$
And since the coarse lattice field is the average of the fine lattice field, you know from the central limit theorem that the distribution will be Gaussian:
$$ Z = \int e^{\int {1\over 2}m^2 \phi^2} D\phi$$
The last line is a path integral, a partition-function-like sum over all cofigurations, and the identity that guarantees that it reproduces independent local fluctuations is just the same thing that tells you that two independent systems have a partition function that multiplies (or a free energy which adds up).
This sort of thing is very boring, and it is the typical situation in statistical or quantum fields--- totally local independent fluctuations, what is often called an ultra-local field. This is not something which we would observe as a dynamical thing, since we live at scales much bigger than any graininess scale.
So consider what would happen if we tune the fluctuations to be large. This requires fine-tuning the effective Gaussian fluctuation parameter $m^2$ to 0. This is not particularly hard to imagine, because you use the central limit theorem to get the $m^2$ behavior--- you can imagine that the microscopic potential is really of the form $m^2 \phi^2 + \lambda\phi^4$, and then if you fine tune the $m^2$ to be a special value, you will change the stability of the $\phi=0$ point. This is a critical point in statistical mechanics.
Now if you look at long distances, you expect that the free energy of field configurations on a coarse grained lattice should go like:
$$ \int F(\nabla\phi, \phi) $$
Where you expand out only the most important derivative terms. If you write it as a series, assuming $\phi\rightarrow -\phi$ symmetry:
$$ \int |\nabla \phi |^2 + t \phi^2 + \lambda \phi^4 + g \phi^6 + h |\nabla^2\phi|^2 $$
Then you can convince yourself that under rescaling, keeping the coefficient of the first term fixed, only the first 3 terms matter in dimension 4 or less. This is saying that if you normalize the field fluctuations so that the leading derivative correlation coefficient determines the scale of the field, only the quadratic and quartic term are renormalizable, only these contribute to long-distance correlations.
The Feynman path integral justifies why this sort of reasoning has anything to do with quantum mechanics. Any bosonic field theory with a time-reversal invariant action analytically continues to a statistical field, and this statistical field is a long-wavelength limit of some short-distance thing. The classification of the possible theories is then by the generalization of the central limit theorem that tells you that the ultralocal field is the most common case.
For chiral fermions and gauge fields, you don't even need fine-tuning of a parameter to have a fluctuating limit. The gauge fields keep a fluctuating limit by gauge invariance, and the chiral fermions by the fact that they can't make a mass without pairing up. These are the ingredients of the standard model.
The justification for the Lagrangians in field theory ultimately come from renormalizability, but this is difficult because a rigorous theory is lacking. One can justify them also by asking for a theory where you have a finite number of fundamental particles of given spin and mass, which for spin 0 reproduces the non-interacting ($\lambda=0$) version of this argument. This is a somewhat complementary argument because unitrarity imposes stronger constraints on the form of the quantum Lagrangian than just being renormalizable, so it is good to know both chains of reasoning.