Secretary problem - why is the optimal solution optimal?
Since this is all or nothing, there is no point in selecting a candidate who is not best so far.
If there are $n$ candidates in total and the $k$th candidate is the best so far then the probability that the $k$th candidate is best overall is $\frac{k}{n}$, which is an increasing function of $k$.
So if you have a decision method which selects the $k$th candidate when best so far but not the $m$th candidate when best so far, with $k \lt m$, then a better (or, in an extreme case, not worse) decision method would be not to select the $k$th candidate when best so far but to select the $m$th candidate when best so far.
So in an optimal method, if at any stage when you are willing to select a best so far candidate, you should be willing to select any subsequent best so far candidates. That gives the strategy in your question of not selecting up to a point and then selecting any best so far candidates after that point.
There is an extreme case: for example if there are two candidates, it does not make any difference whether you accept the first candidate or wait to see the second candidate.
This is merely a rephrasing of Henry's answer. It's probably utterly useless to give the same proof again in different words, so I'm making it community-wiki.
Recall the secretary problem: At each time $k$, you want to decide whether to pick candidate $k$ or not. You win if you the candidate you pick is the best of all $n$ candidates, and you want to maximize the probability of winning.
Now note that at any time $k$, if candidate $k$ is not the best among candidates $1, \dots, k$, then if you pick the candidate you lose immediately. (This candidate is not even best among those seen at that point; so cannot be best among all candidates.) So you will only pick a candidate if she's best among those seen until then. Thus, any optimal strategy will be of the form that does, at each time $k$:
If candidate $k$ is better than everyone else seen previously and [some additional conditions], pick candidate $k$ and stop.
The "[some additional conditions]" must depend only on information you have upto time $k$, and in particular only on $k$ and the relative order of candidates $1$ to $k$. But since you only care about whether you pick the best candidate or not, the only relevant factor of the relative order is who the best is among those seen at that point, and that you already know is candidate $k$. So the "[some additional conditions]" is some predicate $P(k)$ that depends only on $k$.
Now note that if you pick candidate $k$ (who is best among $1$ to $k$), then the probability of winning is $k/n$ (the probability that the best was among the first $k$). This is an increasing function of $k$, which means that if $k$ is good so is $k+1$ (i.e., if $P(k)$ is true then so should $P(k+1)$ be). So $P(k)$ is of the form $[k \ge t]$ i.e., the range of good $k$ is some interval $\{t, \dots, n\}$ for some $t$.
This is what you wanted proved. (And as Henry observed, for the special case $n=2$, both $t=1$ and $t=2$ work, but for larger $n$, you can prove that $t$ should be $n/e$.)
A relevant paper using linear programming. References Dynkin and Lindley with regard to the proof of optimality. Specifically:
Dynkin, E. B. 1963. The Optimum Choice of the Instant for Stopping a Markov Process. Sov. Math. Dokl. 4.
Lindley, D. V. 1961. Dynamic Programming and Decision Theory. Applied Statistics 10, 39-51.
An excerpt from Dynkin's paper.
And a link to Lindley's paper.
This PDF has some historical information on the problem.
This seems a fairly good discussion of the problem, as well (and it even references the linked paper).
And, from a more general perspective, the theory of optimal stopping is obviously relevant.