Ideas in the elementary proof of the prime number theorem (Selberg / Erdős)
The complex-analytic proof of the prime number theorem can help inform the elementary one.
The von Mangoldt function $\Lambda$ is related to the Riemann zeta function $\zeta$ by the formula $$ \sum_n \frac{\Lambda(n)}{n^s} = -\frac{\zeta'(s)}{\zeta(s)},$$ at least in the region $\mathrm{Re}(s) > 1$. The right-hand side has a pole of residue 1 at the simple pole $s=1$ of $\zeta$, and a pole of residue -1 at each zero $s=\rho$ of $\zeta$. Applying Perron's formula carefully, one eventually arrives at the von Mangoldt explicit formula
$$ \sum_{n \leq x} \Lambda(n) = x - \sum_\rho \frac{x^\rho}{\rho} + \dots$$ where the remaining terms $\dots$ (as well as the way in which the infinite sum over zeroes is interpreted) will be ignored for this discussion. This formula strongly suggests that zeroes $\rho$ with real part strictly less than one will eventually only make a lower order contribution to $\sum_{n \leq x} \Lambda(n)$, whereas zeroes $\rho$ with real part equal to one would likely destroy the prime number theorem. In fact one can pursue this line of reasoning more carefully and show that the prime number theorem is equivalent to the absence of zeroes of zeta on the line $\mathrm{Re}(s)=1$.
But what if there was a way to make the contribution of the zeroes of zeta smaller than those of the pole, even when said zeroes lie on the line $\mathrm{Re}(s)=1$? One way to do this is to replace the log-derivative $-\frac{\zeta'(s)}{\zeta(s)}$ by the variant $\frac{\zeta''(s)}{\zeta(s)}$. Assuming for simplicity that all zeroes are simple, this function still has a simple pole at every zero $\rho$ (with residue $\frac{\zeta''(\rho)}{\zeta'(\rho)}$), but now has a double pole at $s=1$ (with residue $2$). As it turns out, this variant also has a Dirichlet series representation $$ \sum_n \frac{ \Lambda_2(n) }{n^s} = \frac{\zeta''(s)}{\zeta(s)}$$ which by Perron's formula suggests an asymptotic of the form $$ \sum_{n \leq x} \Lambda_2(n) = 2 x \log x + \sum_\rho \frac{\zeta''(\rho)}{\zeta'(\rho)} \frac{x^\rho}{\rho} + \dots$$ Now, even if there are zeroes on $\mathrm{Re}(s)=1$, their contribution should still be smaller than the main term, and one may now be led to predict an asymptotic of the form $$ \sum_{n \leq x} \Lambda_2(n) = 2 x \log x + O(x) \quad (4)$$ and furthermore this asymptotic should be easier to prove than the prime number theorem, as it is not reliant on preventing zeroes at the edge of the critical strip.
The functions $\Lambda_2$ and $\Lambda$ are related via the identity $$ \frac{\zeta''(s)}{\zeta(s)} = -\frac{d}{ds} (-\frac{\zeta'(s)}{\zeta(s)}) + (-\frac{\zeta'(s)}{\zeta(s)})^2$$ which after inverting the Dirichlet series leads to $$ \Lambda_2(n) = \Lambda(n) \log n + \sum_{d|n} \Lambda(d) \Lambda(\frac{n}{d}) \quad (5)$$ and from this and (4) we soon obtain (2) and hence (1).
One can also think about this from a "music of the primes" standpoint. The explicit formula is morally resolving $\Lambda$ into "harmonics" as $$ \Lambda(n) \approx 1 - \sum_\rho n^{\rho-1} + \dots \quad (6)$$ where I am deliberately being vague as to how to interpret the symbols $\approx$, $\sum$, and $\dots$. The Dirichlet convolution of $1$ with itself resembles $\log n$, while the Dirichlet convolution of $-n^{\rho-1}$ with itself resembles $n^{\rho-1} \log n$; also, cross-terms such as the convolution of $1$ with $-n^{\rho-1}$ experience significant cancellation and should be lower order. Thus the right-hand side of (5) amplifies the "1" harmonic of $\Lambda$ by a factor of about $2 \log n$, while damping out the contributions of each of the zeroes $\rho$ due to cancellation between the two terms, leading one to (1) or (2) as a smoothed out version of the prime number theorem. Conversely, (2) combined with (5) can be viewed as an elementary way to encode (the largest terms of) the explicit formula (4), without explicit mention of zeroes.
The most difficult part of the Erdos-Selberg elementary proof of the PNT is the passage from (1) to the prime number theorem, as here we must actually eliminate the scenario in which there is a zero on $\mathrm{Re}(s)=1$. The key to doing so is to exploit the non-negativity of $\Lambda$; morally, the expansion (6) prevents $\rho$ (and hence also $\overline{\rho}$) from lying on $\mathrm{Re}(s)=1$ as this would make the right-hand side RHS negative "on average" for certain sets of $n$. There are several ways to make this intuition precise; the standard complex-analytic proofs often proceed by testing $\Lambda$ against $n^{\pm it}$ and $n^{\pm 2it}$, where $1+it$ is a hypothetical zero of $\zeta$. Very roughly speaking, the idea of the elementary proof is to exploit not only the non-negativity of $\Lambda$, but also upper bounds coming from sieve-theoretic bounds such as the Brun-Titchmarsh inequality, which roughly speaking tell us that $\Lambda$ behaves "on average" as if it lies between $0$ and $2$. (Actually, the Selberg formula (2) already gives an adequate substitute for this inequality for the purposes of this argument.) On the other hand, from (6) we see that if there was a zero $\rho = 1+it$ on the line $\mathrm{Re}(s)=1$, then $\Lambda-1$ should have an extremely large correlation with $\mathrm{Re} n^{\rho-1} = \cos(t \log n)$, and these two facts can be placed in contradiction with each other, basically because the average value of $|\cos(t \log n)|$ is strictly less than one. However, because the zeroes $\rho$ are not explicitly present in the Selberg formula (2), one has to work quite a bit in the elementary proof to make this argument rigorous. In this blog post of mine I use some Banach algebra theory to tease out the zeroes $\rho=1+it$ from the Selberg formula (they can be interpreted as the "Gelfand spectrum" of a certain Banach algebra norm associated to that formula) to make this connection more explicit.
Complementing Terry's nice response, let me try to explain from a more elementary point of view why Selberg's formula (1) is natural, and why it is true.
It is natural to formulate the PNT in terms of the von Mangoldt function $\Lambda(n)$, because it is supported on the prime powers, where it is given by the simple analytic formula $\Lambda(p^r)=\log p$. In addition, $\Lambda(n)$ is intimately related to the Möbius function $\mu(n)$ and the Riemann-zeta function via the equivalent formulae $$ \Lambda=\mu\ast\log,\qquad 1\ast\Lambda=\log,\qquad \sum_{n=1}^\infty\frac{\Lambda(n)}{n^s}=-\frac{\zeta'}{\zeta}(s).$$ Selberg's insight was that a natural and slightly more accessible quantity arises if we convolve $\mu$ not with $\log$, but a power of $\log$: $$ \Lambda_k=\mu\ast\log^k,\qquad 1\ast\Lambda_k=\log^k,\qquad \sum_{n=1}^\infty\frac{\Lambda_k(n)}{n^s}=(-1)^k\frac{\zeta^{(k)}}{\zeta}(s).$$ It is a nice exercise that $\Lambda_k(n)$ is nonnegative, hence at most $\log^k(n)$, and it is supported on numbers with at most $k$ distinct prime factors. Now (1) is essentially the same as $$ \sum_{n\leq x}\Lambda_2(n)=2x\log x+O(x), $$ i.e. an analogue of the PNT for numbers with at most two distinct prime factors (which makes the sum smoother). This explains why (1) is so natural and useful to look at. It is another matter, and the ingenuity of Selberg and Erdős, that one can derive information from the distribution of numbers with at most two distinct prime factors for the distribution of prime powers.
In general, we have the following generalization of Selberg's formula (1): $$ \psi_k(x):=\sum_{n\leq x}\Lambda_k(n)=x P_k(\log x)+O(x),\qquad k\geq 2,$$ where $P_k(t)\in\mathbb{R}[t]$ is a polynomial of degree $k-1$ with leading coefficient $k$. Let me sketch a proof of this more general formula. First, using the definitions, $$ \psi_k(x)=\sum_{n\leq x}\mu(n)L_k(x/n),$$ where $L_k$ is the summatory function of $\log^k$: $$ L_k(x):=\sum_{n\leq x}\log^k(n).$$ A key point is that $L_k(x)$ is a smooth sum, in fact by simple calculus $$ L_k(x)=xR_k(\log x)+O(\log^k(x)), $$ where $R_k(t)\in\mathbb{R}[t]$ is a polynomial of degree $k$ with leading coefficient $1$. Now the idea is to compare $L_k$ with the summatory functions of the generalized divisor functions: $$ T_j(x):=\sum_{n\leq x}\tau_j(n),\qquad \tau_j:=1\ast\dots\ast 1. $$ These are easier to analyze than $\psi_k(x)$, in fact one can prove in an elementary way that $$ T_j(x)=xS_j(\log x)+O(x^{1-1/j}), \qquad j\geq 1,$$ where $S_j(t)\in\mathbb{R}[t]$ is a polynomial of degree $j-1$ with leading coefficient $1/(j-1)!$. Here we used the convention that $\tau_1:=1$, so that $T_1(x)=\lfloor x\rfloor=x+O(1)$ as claimed. It follows that $L_k(x)$ can be well-approximated with a linear combination of the $T_j(x)$'s: $$ L_k(x)=k!T_{k+1}(x)+a_kT_k(x)+\dots+a_1T_1(x)+O(x^{1-1/k}).$$ Plugging this back into our formula for $\psi_k$, we get $$ \psi_k(x)=\sum_{n\leq x}\mu(n)\Bigl\{k!T_{k+1}(x/n)+a_kT_k(x/n)+\dots+a_1T_1(x/n)\Bigr\}+O(x).$$ However, for each $j\geq 2$, $$ \sum_{n\leq x}\mu(n)T_j(x/n)=\sum_{n\leq x}(\mu\ast\tau_j)(n)=\sum_{n\leq x}\tau_{j-1}(n)=T_{j-1}(x), $$ and also $$ \sum_{n\leq x}\mu(n)T_1(x/n)=\sum_{n\leq x}(\mu\ast 1)(n)=1, $$ hence in fact $$\begin{align*} \psi_k(x)&=k!T_k(x)+a_kT_{k-1}(x)+\dots +a_2 T_1(x)+a_1+O(x)\\ &=x\Bigl\{k!S_k(\log x)+a_kS_{k-1}(\log x)+\dots + a_2 S_1(\log x)\Bigr\} + O(x).\end{align*} $$ This proves the claimed approximation for $\psi_k(x)$ with the polynomial $$P_k(t):=k!S_k(t)+a_kS_{k-1}(t)+\dots+a_2 S_1(t).$$ This polynomial has real coefficients, degree $k-1$ and leading coefficient $k!/(k-1)!=k$, where we used that $k\geq 2$ throughout. The proof is complete.
Two very nice answers have already been given, but I would like to add that Michel Balazard has a book in preparation on this topic (written for undergrads), in which he tries to give a deduction of the PNT from Selberg's identity which as simple as possible. Also, a student here has a nice write-up of the proof. It is very short, it is elementary, but it is in french.