Intuition behind variational principle

Let's not consider a closed path for now but some smooth curve $\gamma:[0,1]\to M$ , and consider the functional $$S[\gamma]:=\int_0^1 \sum_ip_i(\gamma(t)) \mathrm d q_i(\gamma(t))-H(\gamma(t)) \mathrm d t$$ where I'm working in a chart with coordinates $(q_1,\dots,q_n,p_1,\dots,p_n)$. An old problem in classical mechanics (finding trajectories with fixed endpoints) was formulated such that this function is stationary wrt the true physical trajectory. You can read more about how calculus of variations works, so here I will get the first variation of this functional a bit heuristically. To get the variation, we compute $S[\gamma + \epsilon \eta]-S[\gamma]=\epsilon \delta S[\gamma]+O(\epsilon^2)$, where $\eta:[0,1]\to M$ is a variation with $\eta(0)=\eta(1)=0$. In practice, we can get the variation by operating with $\delta$ as if it were an ordinary differential, so for instance $\delta(a b)=a\delta b + b\delta a$, $\delta(f(x,y))=f_x \delta x + f_y \delta y$ etc. The Hamiltonian explicitly depends on $H(p_1(\gamma(t)),q_1(\gamma(t)),\dots,p_n(\gamma(t)),q_n(\gamma(t)))$, so in varying this we get

$$\delta S = \int_0^1 \sum_i\left(\delta p_i \mathrm d q_i + p_i\mathrm d \delta q_i\right)-\sum_i\left(\frac{\partial H}{\partial p_i}\delta p_i + \frac{\partial H}{\partial q_i}\delta q_i\right)\mathrm d t$$ and now we note that $p_i\mathrm d \delta q_i=p_i\delta q_i-\delta q_i\mathrm d p_i$. This means that we may integrate by parts: $$\delta S = \sum_i\int_0^1 \left(\mathrm d q_i -\frac{\partial H}{\partial p_i}\mathrm d t \right) \delta p_i -\left(\mathrm d p_i +\frac{\partial H}{\partial p_i} \mathrm d t \right) \delta q_i + \sum_i p_i\delta q_i|_{0}^1.$$ The last term is usually called a surface term, and it vanishes because the variation vanishes at the endpoints, since the endpoints are fixed.

Requiring $\delta S = 0$ gives $$\frac{\mathrm d q_i}{\mathrm d t}=\frac{\partial H}{\partial p_i},\,\frac{\mathrm d p_i}{\mathrm d t}=-\frac{\partial H}{\partial q_i}$$ along $\gamma$. This is the same as requiring that $\gamma$ be the flow of the Hamiltonian vector field $X_H$ given by $dH=i_{X_H}\omega$ (modulo a minus sign I always forget). Historically, this was the motivation for symplectic geometry, with the symplectic manifold as the cotangent bundle $T^*Q$ of the $n$-dimensional configuration space $Q$, which is nothing more than the manifold where (generalized) coordinates live. Alternatively we may write $$\dot x = J\nabla H$$

These are properly called Hamilton's equations, although they may be viewed as Euler-Lagrange equations, which are given by the same procedure but with a functional $\int L(\dot q,q;t)\mathrm dt$. The book is probably referring to Hamilton's equations as Euler equations, because that's what they are but for this particular functional.

Now, for your particular problem, you have periodic motion. I remember that you asked in this question about why over a closed path we get $$\int_{\gamma} \lambda:=\int_{\gamma} \sum_i p_i\mathrm d q_i = \frac12\int_0^{2\pi}\langle-J\dot x,x\rangle dt$$ The constraint is a normalization of this, basically. And since $\langle J x,Jy\rangle=\langle x,y\rangle$, this really is equivalent to your constraint. To obtain the correct equations of motion (as they are called) from variational calculus, when constraints are present, we use Lagrange multipliers. Consider the problem of making $\int F(\dot x,x,t) \mathrm dt $ stationary, where $F$ is some functional with the constraint $\int M(\dot x,x,t) \mathrm dt= m$.This can mean for instance that the total mass is m, whatever. The correct functional to which we may apply Euler-Lagrange equations is

$$\int\left(F(\dot x,x,t) + \lambda(x)M(\dot x,x,t)\right)\mathrm dt$$

So, by saying minimize $\int H(x(t))\mathrm dt$ along with the constraint, I'm really saying to minimize the so called "action functional", which is the one I started this answer with ($\lambda$ will drop out during calculation, and turn out to be a constant in this case). All of this should answer your questions 1. and 2.

Whether $S[\gamma]$ attains an infimum or supremum isn't important, only whether it's stationary is. I don't know how I would show what this attains a minimum or maximum. That would require the second variation, or something like the non-negativity of the Weierstrass excess (or simply E) function $E(x,y,z,w)=F(x,y,w)-F(x,y,z)-(w-z)F_{y'}(x,y,z)$.


Let there be given a curve $\vec{p}(t)$ and a real valued function $L$ with the following arguments: this curve, the time derivative of the curve and the time $t$ itself. Now consider the following integral: $$ W\left(\vec{p},\dot{\vec{p}}\right) = \int_{t_1}^{t_2} L\left(\vec{p},\dot{\vec{p}},t\right) dt $$ The integral $W$ is dependent upon the chosen curve $\vec{p}(t)$ ; it is a so-called functional of $\vec{p}$ and $\dot{\vec{p}}$.
We wish to determine for which curves $\vec{p}(t)$ the functional $W(\vec{p},\dot{\vec{p}})$ has an extreme value. But it is not possible to simply take a derivative and put it to zero. So we must do something else.
Assume that the extreme value is obtained for the curve $\vec{q}(t)$ and let $\vec{p}(t) = \vec{q}(t) + \epsilon \vec{f}(t)$ , with the same end-points $\vec{p}(t_1) = \vec{q}(t_1)$ and $\vec{p}(t_2) = \vec{q}(t_2)$, hence $\vec{f}(t_1) = \vec{f}(t_2) = \vec{0}$. For the rest, the curve $\vec{f}(t)$ is arbitrary and the variable $\epsilon$ does not depend on $t$. In this way our functional $W$ has become a common function of $\epsilon$: $$ W\left(\vec{p},\dot{\vec{p}}\right) = \int_{t_1}^{t_2} L\left(\vec{q} + \epsilon \vec{f},\dot{\vec{q}} + \epsilon \dot{\vec{f}},t\right) dt $$ A common real-valued function, which can be differentiated in the common way to find extremes: $$ \frac{dW}{d\epsilon} = 0 \qquad \Longleftrightarrow \qquad \int_{t_1}^{t_2} \frac{dL\left(\vec{q} + \epsilon \vec{f},\dot{\vec{q}} + \epsilon \dot{\vec{f}},t\right)} {d\epsilon} dt \quad \mbox{for} \; \epsilon = 0 $$ Let the (vector) components of $\,\vec{p}\,$ be $\,p_k$ . Apply the chain rule: $$ \frac{dL}{d\epsilon} = \sum_k \frac{\partial L}{\partial p_k} \frac{d p_k}{d\epsilon} + \sum_k \frac{\partial L}{\partial \dot{p}_k} \frac{d \dot{p}_k}{d\epsilon} + \frac{\partial L}{\partial t} \frac{d t}{d\epsilon} = \sum_k \frac{\partial L}{\partial p_k} f_k + \sum_k \frac{\partial L}{\partial \dot{p}_k} \dot{f}_k + 0 $$ The last term can be rewritten with help of: $$ \frac{d}{dt} \left\{ \frac{\partial L}{\partial \dot{p}_k} f_k\right\} = \frac{\partial L}{\partial \dot{p}_k} \dot{f}_k + \frac{d}{dt} \left(\frac{\partial L}{\partial \dot{p}_k}\right) f_k $$ It follows that: $$ \frac{dW}{d\epsilon} = \sum_k \int_{t_1}^{t_2} \left[\frac{\partial L}{\partial p_k} f_k - \frac{d}{dt} \left(\frac{\partial L}{\partial \dot{p}_k}\right) f_k + \frac{d}{dt} \left\{ \frac{\partial L}{\partial \dot{p}_k} f_k\right\} \right] dt = \\ \sum_k \int_{t_1}^{t_2} \left[\frac{\partial L}{\partial p_k} - \frac{d}{dt} \left(\frac{\partial L}{\partial \dot{p}_k}\right) \right] f_k\, dt + \sum_k \left[ \frac{\partial L}{\partial \dot{p}_k} f_k \right]_{t_1}^{t_2} $$ But we know that $f_k(t_1) = f_k(t_2) = 0$ , so the last term is zero. Furthermore we know that, for $\epsilon = 0$ : $p_k = q_k$ , and the functions $f_k$ are completely arbitrary. Therefore it is concluded that the integral is zero if the folowing necessary conditions (not sufficient eventually) are fulfilled for all $k$ : $$ \frac{\partial L}{\partial q_k} - \frac{d}{dt} \left(\frac{\partial L}{\partial \dot{q}_k}\right) = 0 $$ These are the well known Euler-Lagrange equations. It is noted that special notations like $\delta W$ for the "variation" of the integral are not needed. The variation is reduced to common differentiation, which makes the (IMHO confusing) $\delta$ practice completely redundant.