Understanding the countable ordinals up to $\epsilon_{0}$
The standard way to visualize $\epsilon_0$ is by the Hydra game. Here the elements of $\epsilon_0$ are visualized as isomorphism classes of rooted finite trees. The inequality can be described by the "cutting off heads" rule: The tree $T_1$ is greater than $T_2$ is there is a series of head cuttings which reduces $T_1$ to $T_2$. Writing out the inequality relationship between trees directly is a pain, see my blogpost. If you turn those nested sets into trees in the obvious way, you get the Hydra game.
I am told that most people do not find it intuitive that the Hydra game ends. I find that, once I've played a few rounds (try this applet) I find it "obvious", although writing down an actual proof is still painful.
As far as an actual proof, you should directly show the following: Let $X$ be a totally ordered set. Let $\omega^X$ be the set of functions $X \to \omega$ which are $0$ for almost all $x \in X$, ordered lexicographically. Then $\omega^X$ is well ordered. (UPDATE: There are errors in this paragraph, see David Milovich's comment below.)
So every tower of $\omega$'s is well ordered and, $\epsilon_0$, being the union of all such towers, is also well-ordered.
By the way, you don't ask this, but you might be curious what happens when you try to write out this proof within PA. Recall that PA can't directly talk about subsets of $\omega$. The statement that $\omega$ is well-ordered is encoded as an axiom schema. Let $\phi(x, y_1, y_2, \ldots, y_N)$ be any statement with variables $x$ and $y_i$ running through $\omega$. Then PA has the following axiom: $$\forall y_1, y_2, \ldots, y_n \in \omega: \left( \exists x \in \omega : \phi(x, y_{\bullet}) \implies \exists x' \in \omega : \left( \phi(x', y_{\bullet}) \wedge \forall x \in \omega \left( \phi(x, y_{\bullet}) \implies x \geq x' \right) \right) \right).$$ Please read this axiom until you understand that, in English, it says "For all $y$'s, if there is some $x$ obeying $\phi$, then there is a least $x$ obeying $\phi$."
Let's call this axiom $W(\phi, \omega)$. We'll use similar notation with $\omega$ replaced by other sets. Here is a challenging and important exercise: Let $X$ be an ordered set. Let $\phi$ be a statement about $\omega^X$, which may have other variables $y_i$ in it. Construct a specific statement $\sigma(\phi)$ about $X$, with other variables $z_i$ running through $X$, such that $$ W(\sigma(\phi), X) \implies W(\phi, \omega^X) \quad (*).$$
For every specific $\phi$, the statement $(*)$ can be proved in PA. Since $W(\psi, \omega)$ is a axiom of PA for every $\psi$, we can prove $W(\phi, \omega^{\omega^{\ldots^{\omega}}})$ in PA for any $\phi$ and any specific height of tower.
But, in order to show that $\epsilon_0$ is well-ordered, we need to show that $W(\phi, \omega^{\omega^{\ldots^{\omega}}})$ simultaneously for every height of tower. Tracing through the arguments here, you would need to know $W(\phi, \omega)$, $W(\sigma(\phi), \omega)$, $W(\sigma(\sigma(\phi)), \omega)$, $W(\sigma^3(\phi), \omega)$ and so forth. As a human mathematician, that probably doesn't bother you at all. But, in the formal system PA, any proof can only use finitely many axioms. So there is no way to write a proof which uses all of the axioms $W(\sigma^k(\phi), X)$, for all $k$.
Of course, this doesn't show that some more clever argument couldn't prove that $\epsilon_0$ is well-ordered while working with PA; you need the Kirby-Paris theorem for that. (More precisely Kirby-Paris plus Godel shows that, if PA proves $\epsilon_0$ is well-ordered, then PA is inconsistent.) But I find that seeing this obstacle, the need to use infinitely many versions of the well-ordering axiom, clarifies my understanding of what is gong on.
David Speyer's blog post, inspired by the MO question referenced by Francois Dorais in the comments, contains a detailed answer to this question.
In the course of writing an expository article on the consistency of arithmetic, I was led to try to explain $\epsilon_0$ in as "finitary a manner" as possible. Here is what I came up with, inspired greatly by the account in Torkel Franzen's book Inexhaustibility: A Non-Exhaustive Account, as well as the LISP programming language.
Define a list to be either an empty sequence—denoted by ( ) and referred to as the empty list—or, recursively, a finite nonempty sequence of lists. If we write a list by separating the constituent elements by commas and enclosing the entire sequence with parentheses, then for example, (( ), ( ), ( )) and ((( ), ( )), ((( ), (( ), ( ))), ( ))) are lists. The number of constituent lists is called its length (it is zero for the empty list). If $a$ is a nonempty list, then we write $a[i]$ for the $i$th constituent list of $a$, where $i$ ranges from 1 to the length of $a$.
Next, recursively define a total ordering $\le$ on lists as follows (it is essentially a lexicographic ordering). Let $a$ and $b$ be lists, with lengths $m$ and $n$ respectively. If $m\le n$ and $a[i] = b[i]$ for all $1\le i \le m$ (this condition is vacuously satisfied if $m=0$), then $a\le b$. Otherwise, there exists some $i$ such that $a[i] \ne b[i]$; let $i_0$ be the least such number, and declare $a< b$ if $a[i_0] < b[i_0]$.
Finally, recursively define a list $a$ to be an ordinal (or more precisely, an ordinal below $\epsilon_0$, but I will just say ordinal for short) if all its constituent lists are ordinals and $a[i] \ge a[j]$ whenever $i<j$. (In particular, the empty list is an ordinal, since the condition is vacuously satisfied.)
As an exercise, you can verify that the ordinals ((( ))) and ((( ), ( ))) and ((( )), (( ))) are, in standard notation, the same as $\omega$, $\omega^2$, and $\omega \cdot 2$ respectively.
My personal opinion is that the usual ways of defining $\epsilon_0$ make it seem mind-bogglingly infinitary, but the above definition makes it clear that any particular ordinal is just a special type of finite list of finite lists of finite lists of … of finite lists. Not very infinitary at all.
The proof that there is no infinite descending sequence is not particularly mind-boggling either, in this language. Rather than type it out here, I refer you to the proof of Theorem 1 in my expository article.
I find that after understanding this proof, the hard thing to wrap my head around is how it can possibly be true that PA does not prove that there is no infinite descending sequence. My current gut feeling is that PA seems weirdly weak, because it cannot even formalize a proof as simple as this one.