How was the normal distribution derived?
Suppose I throw a dart into a dartboard. I aim at the centre of the board $(0,0)$ but I'm not all that good with darts so the dart lands in a random position $(X,Y)$ which has a joint density function $f:\mathbb R^2\to\mathbb R^+$.
Let's make two assumptions about the way I play darts.
1.$\qquad$ The density is rotationally invariant so the distribution of where my dart lands only depends on the distance of the dart to the centre.
2.$\qquad$ The random variables $X$ and $Y$ are independent, how much I miss left and right makes no difference to the distribution of how much I miss up and down.
So by assumption one and Pythagoras I must be able to express the density $$f(x,y) = g(x^2 + y^2).$$
Now as the random variables $X$ and $Y$ are independent and identically distributed I must be able to express $$f(x,y) \propto f(x,0) f(0,y)$$ Combining these assumptions we get that for every pair $(x,y)$ we have $$g(x^2 + y^2) \propto g(x^2)g(y^2).$$
This means that $g$ must be an exponential function $$g(t) = A e^{-Bt}$$
So A will be some normalising constant. B somehow reflects the units I'm measuring in. (So if I measure the distance in cm B will be 10 times as big as if I measured in mm). $B$ must be negative because the density should be a decreasing function of distance (I'm not that bad at darts.)
So to work out $A$ I need to integrate $f(\cdot,\cdot)$ over $\mathbb R^2$ a quick change of coordinates and $$\iint_{\mathbb R} f(x,y) dxdy = 2\pi\int_0^\infty t g(t) dt = \frac{2\pi}{B^2}. $$ for
So we should set $A = \frac{B^2}{2\pi}$ it's convenient to choose $B$ in terms of the standard deviation, so we set $B = \frac 1{2\sigma}$ and $A = \frac{1}{2\pi\sigma^2}$.
So if I set $\tilde f(x) = \frac 1{\sqrt{2\pi}\sigma} e^{-\frac{x^2}{2\sigma}}$ then $f(x,y) = \tilde f(x) \tilde f(y)$.
So the $e$ comes from the fact I wanted my $X$ and $Y$ coordinates to be independent and the $\pi$ comes from the fact that I wanted rotational invariance so I'm integrating over a circle.
The interesting thing happens if I throw two darts. Suppose I throw my first dart aiming at $(0,0)$ which lands at $(X_1,Y_1)$, I aim my next dart at the first dart, so this one lands at $(X_2,Y_2)$ with $X_2 = X_1 + X$ and $Y_2 = Y_1 + Y$.
So the position of the second dart is the sum of the two errors. But my sum is still rotationally invariant and the variables $X_2$ and $Y_2$ are still independent, so $(X_2,Y_2)$ satisfies my two assumptions.
That means that when I add independent normal distributions together I get another normal distribution.
It's this property that makes it so useful, because if I take the average of a very long sequence of random variables I should get something that's the same shape no matter how long my sequence is and taking a sequence twice as long is like adding the two sequences together. It's this property of the normal distribution that makes it so useful.
PS a factor of two seems to be wrong in my derivation but I have to go to the airport now.
The Normal distribution came about from approximations of the binomial distribution (de Moivre), from linear regression (Gauss), and from the central limit theorem. The derivation given by Tim relates more closely to the linear regression derivation, where the amount of error is represented by a Normal distribution when errors are assumed symmetric about a mean, and to decrease away from the mean. I used Tim's answer and made it a little more formal.
Theorem: Two identically distributed independent random variables follow a distribution, called the normal distribution, given that their probability density functions (PDFs) are known to be continuous and differentiable, symmetric about a mean, and decrease towards zero away from the mean.
Proof: Let $X$ and $Y$ be identically distributed independent random variables with continuous and differentiable PDFs. It is assumed that the PDFs are even functions, for example $f_X(x) = f_X(-x)$, and $f_X(x) \rightarrow 0 \text{ as } x\rightarrow \pm \infty$.
Their joint PDF, because of their independence, is $f_{XY}(x,y) = f_X(x)f_Y(y)$. Because they are identically distributed and symmetric, only the norm or magnitude of the two variables is unique - that is, $x$ and $y$ can be interchanged with no effect on the final probability. They are identically distributed and symmetric, figuratively related to a circle, as opposed to the unequally distributed oval. Therefore, there must exist a function $g(r)$ such that $$ f_{XY}(x,y) = g(\sqrt{x^2 + y^2}) $$ Which, because $g$ is not yet determined, is equivalent to $$ f_{XY}(x,y) = g(x^2 + y^2). $$
From the definition of the joint distribution, $$ f_X(x)f_Y(y) = g(x^2 + y^2). $$ Which, for $y=0$, gives $$ f_X(x) \propto g(x^2). $$ Assuming $f_Y(0)$ is a constant. Similar argument gives $$ f_Y(y) \propto g(y^2). $$ These last two results are significant, because substitution shows that the product of $g(x^2)g(y^2)$ is proportional to the sum $g(x^2 + y^2)$: $$ g(x^2)g(y^2) \propto g(x^2 + y^2) $$ And it is known from algebra that the only function to have this property is the exponential function (and the natural logarithm).
This is to say, $g(r)$ will be some type of exponential, $$ g(r) = Ae^{Br} .$$ Where $A$ and $B$ are constants yet to be determined. We assume, now, that wherever the expected value is, the probability of error away from this expected value with decrease. That is to say, we expect that the chance of error should be minimum near the expected value, and decrease to zero away from this value. Another way of saying this is that the mean must be the maximum of $g(r)$, and yet another way of saying this is that $g(r)$ must approach $0$ as $r\rightarrow \pm \infty$. In any case, we need the argument to the exponential to be negative. $$ g(r) = Ae^{-Br} $$
Now if we return to our joint PDF, $$ f_{XY}(x,y) = f_X(x)f_Y(y) = g(x^2 + y^2) $$ Here again, we can investigate the PDF of $f_X(x)$ alone by setting $y=0$, $$ f_X(x) \propto g(x^2) = A e^{-Bx^2} $$ Note that the mean of this distribution is $0$. In order to give a mean of $\mu$, this distribution becomes $$ f_X(x) \propto A e^{-B(x-\mu)^2} .$$
The constants are determined from the fact that the integral of the PDF $f_{X}(x)$ must be equal to 1 for the entire domain. That is, the cumulative distribution function (CDF) must approach 1 at the upper limit (probability cannot be >100%). $$ \int_{0}^{\infty} f_{X}(x) dx = 1 $$ The integral of $e^{-Bx^2}$ is $\sqrt{\frac{\pi}{B}}$, thus $$ A \int_0^\infty e^{-Bx^2} dx = A \sqrt{\frac{\pi}{B}} = 1$$ $$ A = \sqrt{\frac{B}{\pi}} $$ The constant $B$ is, for convenience, substituted by $\sigma^2 = \frac{1}{2B}$, so that $A = \frac{1}{\sqrt{2\pi\sigma^2}}$ and $$ f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} .$$ Which is, of course, the common form of what is known as the Normal distribution. Note that the proportional symbol became an equals sign, which is necessary from the assumption that $X$ is a random variable, and all random variables have a PDF which integrates to $1$. This proves the theorem.
One will find that $\sigma^2$ is called the variation, and $sigma$ is the standard deviation. The parameters $\sigma^2$ and $\mu$, that is, the variation and the mean, may be chosen arbitrarily, and uniquely define the distribution.
It's a beautiful thing - without knowing much about the random variables, other than the fact that they are independent, and come from the same symmetric distribution, we can logically determine the distribution they come from.
To write this answer, I followed the references supplied in comments of the question, the description given by Tim, and various online resources such as Wikipedia. Therefore, the proof may not be rigorous, and may contain errors. The errors are certainly mine alone, due to misinterpretations and/or miscalculations.
It is very old questions. But still, there is a very interesting link where you can find the derivation of density function of Normal distribution. This will help in understanding the construction of probability density function of Normal distribution in a more lucid way.