Proof of Rashevskii-Chow theorem
The original references are:
W.-L. Chow, Über Systeme von linearen partiellen Differentialgleichungen erster Ordnung. Math. Ann. 117 (1939), 98–105
P. K. Rashevsky, Any two points of a totally nonholonomic space may be connected by an admissible line. Uch. Zap. Ped. Inst. im. Liebknechta, Ser. Phys. Math. 2 (1938), 83–94 (in Russian).
There are several different proofs of this result. I learned it from Proposition III.4.1 in:
N. Th Varopoulos, L. Saloff-Coste, T. Coulhon, Analysis and Geometry on Groups. Cambridge University Press.
The proof is quite concise, but not too difficult. Here is my own version of the proof from that book.
Chow–Rashevsky theorem
Let $Z$ be a smooth vector field and $Z_{t}$ the local $1$-parameter family of diffeomorphisms associated with $Z$. Fix $f\in C^{\infty}$ and a point $m$. Then the function $h(t) = f(X_{t}(m))$ is smooth and $h^{(k)}(0) = (X^{k}f)(m)$. Hence the Taylor series for $h$ at $t=0$ is given by \begin{equation} (1)\qquad \sum_{k=0}^{\infty} X^{k}f(m) \frac{t^{k}}{k!}, \end{equation} which means $$ h(t) = \sum_{k=0}^{i} X^{k}f(m) \frac{t^{k}}{k!} + O(t^{i+1}) \qquad {\rm as} \ t\to\infty. $$ We will use the formal expresion $(e^{tX}f)(m)$ to denote (1).
Let $Z_{1},\ldots,Z_{k}$ be smooth vector fields. Let $f\in C^{\infty}$. Fix a point $m$ and define $$ H(t_{1},\ldots,t_{k}) = f(Z_{1,t_{1}}\circ Z_{2,t_{2}} \circ\cdots \circ Z_{k,t_{k}}(m)). $$ Note that $$ \frac{\partial^{m_{1}}}{\partial t_{1}^{m_{1}}} H(0,t_{2},\ldots,t_{k}) = (Z_{1}^{m_{1}} f) (Z_{2,t_{2}} \circ\cdots\circ Z_{k,t_{k}}(m)). $$ Taking then the derivatives with respect to $t_{2},\ldots,t_{k}$ yields $$ \frac{\partial^{m_{1}+\ldots+m_{k}}}{\partial t_{1}^{m_{1}}\ldots \partial t_{k}^{m_{k}}} H(0,\ldots,0) = (Z_{k}^{m_{k}}\ldots Z_{1}^{m_{1}}f)(m). $$ Hence the Taylor series for $H$ is given by $$ \sum_{m_{1}=0}^{\infty} \ldots \sum_{m_{k}=0}^{\infty} \frac{t_{1}^{m_{1}}\ldots t_{k}^{m_{k}}}{m_{1}!\ldots m_{k}!} (Z_{k}^{m_{k}}\ldots Z_{1}^{m_{1}}f)(m), $$ which will be formally denoted by $$ (e^{t_{k}Z_{k}}\ldots e^{t_{1}Z_{1}} f)(m). $$ Before we prove the Chow--Rashevsky's theorem we show how to use the above Taylor's formula to prove the following theorem.
Theorem. Let $G$ be a Lie group. Then $$ \exp(tX)\exp(tY) = \exp\Big( t(X+Y) + \frac{t^{2}}{2}[X,Y] + O(t^{3})\Big). $$
Proof. Note that $\exp(tX)\exp(sY)$ is the same as $Y_{s}\circ X_{t}(e)$ ($e$ denotes the neutral element of $G$), because $s\mapsto \exp(tX)\exp(sY)$ is the integral curve of $Y$ passing through $\exp(tX)$ at $s=0$. Thus the Taylor series for $f(\exp(tX)\exp(sY))$ is $e^{tX}e^{sY} f(e)$ and hence the Taylor series for $h(t) = f(\exp(tX)\exp(tY))$ at $t=0$ is \begin{eqnarray*} e^{tX}e^{tY}f(e) & = & \Big(1+tX + \frac{t^{2}}{2}X^{2} + O(t^{3}) \Big) \Big(1+tY + \frac{t^{2}}{2}Y^{2} + O(t^{3}) \Big)f(e) \\ & = & f(e) + t(X+Y)f(e) + t^{2}\Big(\frac{X^{2}}{2} + XY + \frac{Y^{2}}{2}\Big)f(e) + O(t^{3}) \end{eqnarray*} Now there is a smooth function $t\mapsto Z(t)$, $Z(0) = 0$ such that $$ \exp(tX)\exp(tY) = \exp(Z(t)) $$ for small $t$. We can write $Z(t) = tZ_{1}+t^{2}Z_{2} + O(t^{3})$. Since $f(\exp(tW)) = f(e) + tWf(e) + \frac{t^{2}}{2}W^{2}f(e) + O(t^{3})$ and since obviously $f(A(t) + O(t^{3})) = f(A(t)) + O(t^{3})$, we have $$ f(\exp(Z(t)) = f(\exp(t(Z_{1}+tZ_{2}))) + O(t^{3}). $$ Fix $s$ and then $$ f(\exp(t(Z_{1}+sZ_{2}))) = f(e) + t(Z_{1}+sZ_{2})f(e) + \frac{t^{2}}{2}(Z_{1}+sZ_{2})^{2}f(e) + O(t^{3}) = A $$ Now substituting $s=t$ yields $$ A = f(e) + tZ_{1}f(e) + t^{2}Z_{2}f(e) + \frac{t^{2}}{2}Z_{1}^{2}f(e) + O(t^{3})). $$ Taking coordinate functions as $f$ and comparing the Taylor series yields $$ Z_{1} = X+Y,\qquad Z_{2} + \frac{Z_{1}^{2}}{2} = \frac{X^{2}}{2} + XY + \frac{Y^{2}}{2}. $$ Hence $Z_{2} = \frac{1}{2}[X,Y]$, which implies $$ Z(t) = t(X+Y) + \frac{t^{2}}{2}[X,Y] + O(t^{3}), $$ and hence the theorem follows. $\Box$
As an immediate consequence we obtain
Corollary $\exp(-tX)\exp(-tY)\exp(tX)\exp(tY) = \exp(t^{2}[X,Y] + O(t^{3})).$
We will see now that the corollary holds for arbitrary smooth vector fields, not necessarily on the Lie group.
Corollary $Y_{t}\circ X_{t}\circ Y_{-t} \circ X_{-t}(m) = m+ t^{2}[X,Y]_{m} + O(t^{3})$.
Proof. The Taylor series for $h(t) = f(Y_{t}(X_{t}(Y_{-t}(X_{-t}(m)))))$ is \begin{eqnarray*} e^{-tX}e^{-tY}e^{tX}e^{tY} f(m) & = & (1 - tX + \frac{t^{2}}{2}X^{2} + O(t^{3})) (1 - tY + \frac{t^{2}}{2}Y^{2} + O(t^{3})) \times \\ & \times & (1 + tX + \frac{t^{2}}{2}X^{2} + O(t^{3})) (1 + tY + \frac{t^{2}}{2}Y^{2} + O(t^{3})) f(m) \\ & = & (1 + t^{2}[X,Y] + O(t^{3})) f(m). \end{eqnarray*} Now we can turn to the main subject of the section, namely the connectivity theorem of Chow and Rashevsky.
Theorem (Chow-Raschevsky) Let $\Omega\subset\mathbb{R}^{n}$ be an open domain and let $X_{1},\ldots,X_{k}$ be smooth vector fields satisfying H"ormander's condition i.e. for some positive integer $d$ comutators of length less than or equal to $d$ span the tangent space $\mathbb{R}^{n}$ at every point of $\Omega$. Then every two points in $\Omega$ can be connected by an admissible curve. Moreover for any compact set $K\subset\Omega$ there is a constant $C>0$ such that \begin{equation} (2)\qquad \rho(x,y) \leq C|x-y|^{1/d} \qquad \mbox{for all $x,y\in K$}. \end{equation}
Remark. The estimate (2) is due to Nagel, Stein and Waigner.
Proof. Let $Y_{1},\ldots,Y_{p}$ be smooth vector fields.
Fix $m\in\Omega$. Define by induction
\begin{eqnarray*}
C_{1}(t) & = & Y_{1,t}(m) \\
C_{p}(t) & = &
C_{p-1}(t)^{-1}\circ Y_{p,-t}\circ C_{p-1}(t) \circ Y_{p,t}(m).
\end{eqnarray*}
Recall that $Y_{j,t}$ denotes the local family of diffeomorpisms associated to
$Y_j$.
Since both $C_{p}(t)$ and $C_{p}(t)^{-1}$ are compositions of diffeomorphisms
$Y_{j,\pm t}$ one easily obtaines that the Taylor series for
$f(C_{p}(t))$ and $f(C_{p}(t)^{-1})$ are given by
$\widetilde{c}_{p}(t)f(m)$ and $\widetilde{c}_{p}(t)^{-1}f(m)$ where
$\widetilde{c}_{p}(t)$ is a formal series defined by induction as follows
\begin{eqnarray*}
\widetilde{c}_{1}(t) & = & e^{tY_{1}} \\
\widetilde{c}_{p}(t) & = & e^{tY_{p}} \widetilde{c}_{p-1}(t)
e^{-tY_{p}} \widetilde{c}_{p-1}(t)^{-1}.
\end{eqnarray*}
It is easy to prove by induction that
\begin{equation}
(3)\qquad
\widetilde{c}_{p}(t) = 1 + t^{p} [Y_{p},[Y_{p-1},[\ldots,Y_{1}]\ldots]
+ O(t^{p+1}),
\end{equation}
and hence
$$
\widetilde{c}_{p}(t)^{-1} = 1 - t^{p} [Y_{p},[Y_{p-1},[\ldots,Y_{1}]\ldots]
+ O(t^{p+1}).
$$
Indeed, for $p=1$, (3) is obvious.
Assume it is true for $p$ and we prove it for $p+1$.
We have
\begin{eqnarray*}
\widetilde{c}_{p+1}(t)
& = &
e^{tY_{p+1}}\widetilde{c}_{p}(t) e^{-tY_{p+1}}\widetilde{c}_{p}(t)^{-1} \\
& = &
e^{tY_{p+1}} (\widetilde{c}_{p}(t) - 1)e^{-tY_{p+1}}
\widetilde{c}(t)^{-1} + \widetilde{c}_{p}(t)^{-1} \\
& = &
(1 + tY_{p+1})(\widetilde{c}_{p}(t)-1)(1 - tY_{p+1})
\widetilde{c}_{p}(t)^{-1} + \widetilde{c}_{p}(t)^{-1} + O(t^{p+2}) \\
& = &
(\widetilde{c}(t)-1)\widetilde{c}_{p}(t)^{-1} +
t^{p+1}[Y_{p+1},[Y_{p},[\ldots,Y_{1}]\ldots] +
\widetilde{c}_{p}(t)^{-1} + O(t^{p+2}) \\
& = &
1 + t^{p+1}[Y_{p+1},[Y_{p},[\ldots,Y_{1}]\ldots] + O(t^{p+2}).
\end{eqnarray*}
The claim is proved.
Hence the Taylor series of $f(C_{p}(t))$ at $t=0$ begins with $$ f(m) + t^{p}[Y_{p},[Y_{p-1},[\ldots,Y_{1}]\ldots]f(m) + O(t^{p+1}) $$ and the Taylor series of $f(C_{p}(t)^{-1})$ at $t=0$ begins with $$ f(m) - t^{p}[Y_{p},[Y_{p-1},[\ldots,Y_{1}]\ldots]f(m) + O(t^{p+1}). $$ Now if $F_{1}$ and $F_{2}$ are two $C^{\infty}$ functions with Taylor series $F_{1}(t) = a + bt^{p} +\ldots$ and $F_{1}(t) = a - bt^{p} +\ldots$ then it is easy to see that the function $$ G(t) = \left\{ \begin{array}{cc} F_{1}(t^{1/p}) & \mbox{if $t\geq 0$} \\ F_{2}((-t)^{1/p}) & \mbox{if $t<0$} \end{array} \right. $$ is $C^{1}$ in the neighborhood of $0$ and $G'(0)=b$.
Taking $F_{1}(t)=f(C_{p}(t))$ and $F_{2}(t) = f(C_{p}(t)^{-1})$, where $f$ are all coordinate functions we conclude that the function $$ \phi(t) = \left\{ \begin{array}{cc} C_{p}(t^{1/p}) & \mbox{if $t\geq 0$} \\ C_{p}((-t)^{1/p})^{-1} & \mbox{if $t<0$} \end{array} \right. $$ is a $C^1$ path passig through $m$ at $t=0$ with $\phi'(0) = [Y_{p},[Y_{p-1},[\ldots,Y_{1}]\ldots]$.
Let $V_{1},\ldots,V_{n}$ be a basis of $\mathbb{R}^{n}=T_{m}\Omega$ arising from H"ormander's condition i.e., $$ V_{i} = [X_{i,p_{i}},[X_{i,p_{i}-1},[\ldots,X_{i,1}]\ldots], $$ where $i=1,2,\ldots,n$, $p_{i}\leq d$ and $X_{i,l}\in\{ X_{1},\ldots,X_{k}\}$. Let $\phi_{i}(t)$ be a $C^1$ path defined as above for $Y_{1}=X_{i,1},\ldots,Y_{p_{i}} = X_{i,p_{i}}$. Then $\phi_{i}'(0)=V_{i}$. Finally define $\Phi$ by $$ \Phi(\theta) = \phi_{1}(\theta_{1})\circ \cdots \phi_{n}(\theta_{n}), \qquad \theta = (\theta_{1},\ldots,\theta_{n}). $$ Then $\Phi$ is a $C^1$ mapping from a neighborhood of $0$ in $\mathbb{R}^{n}$ to $\Omega$. Since $\partial\Phi/\partial\theta_{i}(0)=\phi_{i}'(0)=V_{i}$ we conclude that $\Phi$ is a diffeomorphism in a neighborhood of $0$. This implies that any point in the neighborhood of $m=\Phi(0)$ can be connected to $m$ by an admissible curve.
More procisely $\phi_{i}(\theta_{i})$ is a composition of diffeomorphisms of the form $X_{j,\pm|\theta_{i}|^{1/p_{i}}}$. Hence denoting the composition by $\prod$ we may write \begin{equation} (4)\qquad \Phi(\theta) = \left( \prod_{i=1}^{n} \prod_{\alpha=1}^{M_{i}} X_{i,j_{\alpha},\pm|\theta_{i}|^{1/p_{i}}} \right)(m). \end{equation} Assume that $|\theta|\leq 1$. For any $x$, the two points $x$ and $X_{i,j_{\alpha},\pm|\theta_{i}|^{1/p_{i}}(x)}$ can be connected by an admissible curve --- an integral curve of $X_{i,j_{\alpha}}$ and hence the Carnot--Carath'eodory distance between these two pints is no more than $|\theta_{i}|^{1/p_{i}}\leq |\theta|^{1/d}$. Now we can move from $m$ to $\Phi(\theta)$ on such admissible curves and hence \begin{equation} (5)\qquad \rho(\Phi(\theta),m) \leq C_{1}|\theta|^{1/d} \approx C_{2}|\Phi(\theta)-m|^{1/d}, \end{equation} where $C_{1}=\sum_{i=1}^{n}M_{i}$ equals the number of integral curver we use to join $m$ with $\Phi(\theta)$ (see (4)). We employed also the fact that $|\theta|\approx |\Phi(\theta) - m|$ which is a consequence of the fact that $\Phi$ is a diffeomorphism.
Since we can connect all the points in a neighborchood of any point it easily follows that we can connect any two points in $\Omega$. The estimate (2) follows from (5). $\Box$
Sussmann, Hector J., Orbits of families of vector fields and integrability of distributions, Trans. Amer. Math. Soc., 180, 1973, 171--188, gives a very easy explanation, using flows of vector fields.
The article is available free of charge.
It suffices to have the curve horizontal almost everywhere, because then it will stay tangent to any immersed submanifold whose tangent spaces contain the distribution; just write out local coordinates in which the submanifold is locally given by setting various coordinate functions to constants.
As a reference, in addition to the classical ones cited above, I can recommend the following:
Agrachev, Andrei; Barilari, Davide; Boscain, Ugo, A comprehensive introduction to sub-Riemannian geometry., ZBL07073879.
The proof of the Chow-Rashewski theorem is in Section 3.2. An electronic version of the book is also freely available online (https://www.imj-prg.fr/~davide.barilari/ABB-v2.pdf)
The idea is of course the same as the one in the proof given above by Piotr Hajlasz, but I think that the presentation in the book is more geometric and concise.
Concerning your last question (everywhere vs almost everywhere). Horizontal curves might not be differentiable at certain points (e.g. think at a curve with a corner). In order to define a length, the tangent vector of an horizontal curve $\gamma:[0,1]\to M$ should be defined almost everywhere on $[0,1$]. There are then several regularity classes of curves which one might use (all used in the literature):
- $\gamma \in W^{1,1}$ that is absolutely continuous curves (the largest class one can think of)
- $\gamma \in W^{1,2}$ that is absolutely continuous curves whose tangent vector is $L^2$ (slightly smaller, but natural in view of minimization of the energy functional, and furthermore the space of "admissible velocities" is Hilbert)
- $\gamma \in W^{1,\infty}$ that is curves that are locally Lipschitz in charts (as I comment blow, also this class is natural as one can always reduce to this case when dealing with the length-minimization problem)
in any case, of course, the tangent vector, which is defined almost everywhere, is required to belong to the sub-Riemannian distribution. The proof of the Chow-Rashevskii theorem shows that connectivity is achieved by horizontal curves that are concatenation of a finite number of smooth curves, which belongs to all the classes above (so the choice of regularity class above is irrelevant).
It turns out that also the sub-Riemannian distance (defined as the infimum of the length of horizontal curves between two points) does not depend on the choice of the regularity class. This is due to the fact that, within a given regularity class ($W^{1,1}$, $W^{1,2}$ or $W^{1,\infty}$) one can always reparametrize the curve, without changing its length, in such a way that the reparametrized curve has constant speed. This is proved in Section 3.6 of the book by Agrachev, Barilari and Boscain.