Cayley-Hamilton theorem on square matrices
There are many, many ways to prove the Cayley-Hamilton theorem, and many of them have something to offer in the way of simplicity, generality, or of providing insight into why such a result could be expected to hold in the first place. For instance the case of distinct eigenvalues you could prove can be seen to suffice, once one realises that $p_{A}(A)=0\in\mathrm{Mat}_n(\mathbf C)$ is just a (tremendously complicated) system of $n^2$ polynomial equations in the coefficients of $A$, each of which will hold everywhere once it holds on the dense subset of matrices with $n$ distinct eigenvalues.
Personally I find the following approach leads most naturally to the Cayley-Hamilton theorem, i.e., it emerges quite logically from general considerations that are anyway central to eigenvalue problems, rather than being a claim coming out of the blue.
Certainly (by finite dimensionality of the matrix space) there exist nonzero polynomials $P$ with $P[A]=0$, and by a standard Euclidean division argument there is a polynomial $\mu_A$ such that this happens exactly when $P$ is a (polynomial) multiple of $\mu_A$, the minimal polynomial of $A$. An easy but crucial observation is $(*)$: for any polynomial $D$ dividing $\mu_A$, the minimal polynomial of the restriction of (the linear map defined by) $A$ to the subspace that is the image of $D[A]$ is given by $\mu_A/D$. To see this, for any polynomial$~Q$ the restriction of$~Q[A]$ to the image of $D[A]$ is zero if and only if $Q[A]\circ D[A]=(QD)[A]$ is zero on the whole space.
Every eigenvalue$~\lambda$ of$~A$ is a root of $\mu_A$, since applying $\mu_A(A)=0$ to an eigenvector multiplies it by $\mu_A(\lambda)$. From $(*)$ it follows that conversely every root$~\lambda$ of $\mu_A$ is eigenvalue of$~A$ (the polynomial $D=X-\lambda$ divides $\mu_A$, so $D(A)=A-\lambda I$ is not surjective, as the restriction of$~A$ to its image has minimal polynomial $\mu_A/D\neq\mu_A$). If $m$ is the multiplicity if $X-\lambda$ as (irreducible) factor of $\mu_A$, then the kernel of $(A-\lambda I_n)^m$ is by definition the characteristic subspace (or generalised eigenspace) for$~\lambda$ of$~A$. The restriction of$~A$ to the image of $(A-\lambda I_n)^m$ has no eigenvalue $\lambda$ (its minimal polynomial $\mu_A/(X-\lambda)^m$ has no root$~\lambda$), so forms a direct sum with the characteristic subspace for $\lambda$; by the rank-nullity theorem they form a direct sum decomposition of the whole space. By induction on the number of eigenvalues, the space then decomposes as direct sum of characteristic subspaces, provided $\mu_A$ splits into linear factors, as certainly happens over an algebraically closed fields like the complex numbers.
All this is standard stuff about the minimal polynomial. Visibly it has the same roots as the characteristic polynomial $p_A$ (namely all eigenvalues), and it is natural to compare the multiplicities. The multiplicity of $\lambda$ as root of $\mu_A$ is the number $m$ above, and its multiplicity as a root of $p_A$ is the dimension of the characteristic subspace for $\lambda$. By $(*)$, the powers $(A-\lambda I_n)^i$ for $i=0,1,\ldots,m$ must all have distinct (ever smaller) images, and therefore also all have distinct (ever larger) kernels, so the dimension of the largest of these kernels, the characteristic subspace for $\lambda$, is at least $m$. This means that (assuming an algebraically closed field) the characteristic polynomial is a multiple of $\mu_A$; thus $p_A(A)=0$, the Cayley-Hamilton theorem.
The algebraically closed field case can be seen to imply the case over arbitrary fields, since neither the characteristic polynomial, not the minimal polynomial, nor the divisibility relation of them changes when the field is extended. If you don't like this flight into algebraically closed fields (in fact just a splitting field of the minimal polynomial will suffice), there is an alternative approach that works with cyclic subspaces (spanned by iterated images of a single nonzero vector) instead of eigenspaces. This leads to a basic case involving companion matrices, and an induction argument; see this answer.
For a complex matrix $A,$ the easiest proof I know is to note that the results is true for upper triangular matrices, and then to note that there is a unitary matrix $U$ such that $U^{-1}AU$ is upper triangular. To see the second claim, note that $A$ has an eigenvector $u$ of length $1,$ and complete $\{ u \}$ to an orthonormal basis and proceed by induction. The first claim is a direct computation, as the characteristic polynomial of an upper triangular matrix is visible.
For any diagonal matrix $D$, it is clear that $p_D(D)=0$. Therefore, for any $P \in GL_n(\mathbb{C})$, $p_{PDP^{-1}}(PDP^{-1})=p_D(PDP^{-1})=Pp_D(D)P^{-1}=0$: Cayley theorem holds for diagonalizable matrices. Moreover:
Lemma: the set of diagonalizable matrices is dense in $M_n(\mathbb{C})$.
Proof: Let $M \in M_n(\mathbb{C})$. For some $P \in GL_n(\mathbb{C})$, $M=PT(\lambda_1,\dots,\lambda_n)P^{-1}$ where $T(\lambda_1,\dots,\lambda_n)$ is an upper triangular matrix with diagonal $(\lambda_1,\dots,\lambda_n)$. Let $\epsilon_{ij}$ such that $\lambda_i+\epsilon_{ij} \neq \lambda_k + \epsilon_{kj}$ for $i \neq k$ and $\epsilon_{ij} \underset{j \to + \infty}{\longrightarrow} 0$. Then $PT(\lambda_1+ \epsilon_{1j},\dots,\lambda_n+ \epsilon_{nj})P^{-1}$ is diagonalizable (all their eigenvalues are distincts) and converges to $M$. $\square$
Now $\det$ is continuous, so if $M \in M_n(\mathbb{C})$ and $(M_n)$ is a sequence of diagonalizable matrices converging to $M$, $p_M(M)=\lim\limits_{n \to + \infty} p_{M_n}(M_n)=0$.