Is this "derivation" of the path length formula actually correct?
What he did is correct, though the reasons for doing so seem to be glossed over. I'll give you a rigorus version of what he did.
Let $\Delta x > 0$, represent the length of some horizontal line segment, and for now let $y = f(x)$, where $f$ is some function. (If $y$ is not a function, simply break it up into several pieces, where each piece is a function.) Then define $$\Delta y = f(x + \Delta x) - f(x)$$ Now, $\Delta x$ and $\Delta y$ are the lengths of legs of a right triangle, so we should probably talk about the hypotenuse, as well, whose length I will denote by $\Delta s$. Now, by the Pythagorean Theorem, $$(\Delta s)^2 = (\Delta x)^2 + (\Delta y)^2 = (\Delta x)^2\left[1 + \left(\frac{\Delta y}{\Delta x}\right)^2\right]$$ (where I pulled out $(\Delta x)^2$ from both terms on the RHS). Substituting our two expression from above, $$(\Delta s)^2 = (\Delta x)^2\left[1 + \left(\frac{f(x + \Delta x) - f(x)}{\Delta x}\right)^2\right]$$ $$\implies \left(\frac{\Delta s}{\Delta x}\right)^2 = 1 + \left(\frac{f(x + \Delta x) - f(x)}{\Delta x}\right)^2$$
Now, take limits of both sides as $\Delta x \rightarrow 0$: $$\lim\limits_{\Delta x \rightarrow 0}\left(\frac{\Delta s}{\Delta x}\right)^2 = \lim\limits_{\Delta x \rightarrow 0}\left[1 + \left(\frac{f(x + \Delta x) - f(x)}{\Delta x}\right)^2\right]$$ $$\implies \left(\lim\limits_{\Delta x \rightarrow 0}\frac{\Delta s}{\Delta x}\right)^2 = 1 + \left(\lim\limits_{\Delta x \rightarrow 0}\frac{f(x + \Delta x) - f(x)}{\Delta x}\right)^2$$ $$\implies \left(\frac{ds}{dx}\right)^2 = 1 + \left[f'(x)\right]^2 = 1 + \left(\frac{dy}{dx}\right)^2$$ $$\implies \left|\frac{ds}{dx}\right| = \sqrt{1 + \left(\frac{dy}{dx}\right)^2}$$ If you assume $s$ increases as $x$ increases, i.e. the length $s$ of your path increases as you move from left to right, we can drop the absolute values: $$\frac{ds}{dx} = \sqrt{1 + \left(\frac{dy}{dx}\right)^2}$$ or, in differential form, $$ds = dx\sqrt{1 + \left(\frac{dy}{dx}\right)^2}$$
This is what helped me stay afloat during general relativity classes; it might help you too. First note, that this is always in the context of some (given or arbitrary) path $S$. Say your path is parametrised by some parameter $t$, that the path goes from $a$ to $b$ as $t$ goes from $0$ to $1$, just to have some concretes down. Then $ds = \sqrt{dx^2+dy^2}$ can be translated into $$ \frac{ds}{dt}=\sqrt{\left(\frac{dx}{dt}\right)^2+\left(\frac{dy}{dt}\right)^2} $$ or, correspondingly (using the appropriate form of the fundamental theorem of calculus) $$ \int_a^bds=\int_0^1\sqrt{\left(\frac{dx}{dt}\right)^2+\left(\frac{dy}{dt}\right)^2}dt $$ Now you can factor out $\frac{dx}{dt}$ from the square root, and apply the chain rule to get what I have learned to think when I see $ds = dx \sqrt{1+\left( \frac{dy}{dx} \right)^2}$, namely $$ \int_a^bds=\int_0^1\left|\frac{dx}{dt}\right| \sqrt{1+\left(\frac{dy}{dx}\right)^2}dt $$ which is valid as long as the path isn't vertical ($\frac{dy}{dx}$ is evaluated along the path, where $y$ can be a function of $x$ locally, as long as it's not vertical). But if the path is vertical, then you can factor out $\frac{dy}{dt}$ from the square root instead, and everything works out nicely. The integral above can, of course, with simple substitution also be made into an integral over $x$ if that makes the problem easier.
Note that in general relativity (and perhaps elsewhere in physics) the form $ds^2=dx^2+dy^2$ is not unheard of. As far as I can tell that's just because they don't want to bother with the square root sign when writing down formulae. You still definitely have to put the square root back into the expression before any calculations can be done.
Although mathematicians go nuts over this stuff, abusing notation and working with symbols in a way that relies on intuition is often critical to a physicist and useful to a mathematician who wants to build deeper understanding and intuition. We should all be grateful that not everyone is paralyzed by rigor-mortis because scientific discovery benefits from the audacity to break formal rules.
A mathematician would say that for a parameterized curve $\gamma(t) = (x(t), y(t))$, the arc length along the curve between parameter values $t_a$ and $t_b$ is
$$ \int_{t_a}^{t_b} |\dot\gamma(t)|\, dt = \int_{t_a}^{t_b} \sqrt{\dot x(t)^2 + \dot y(t)^2}\, dt $$
This definition is motivated by thinking about little pieces of time $dt$ and the fact that $|\dot\gamma(t)|$ is the speed at time $t$. For little pieces of time, the speed times time will give a little straight line piece of distance traveled, and then one simply adds up all the distances of these little pieces. In other words, there is a very nice physical motivation for this definition of arc length along a parameterized curve.
Now as a special case, suppose that there is a parameterization of the curve whose arc length we're trying to compute taking the following form:
$$ \alpha(s) = (s, y(s)). $$
Since the $x$-coordinate function is simply the identity function, we might as well call this parameter $x$ (it's a dummy variable anyway), in which case we get
$$ \alpha(x) = (x,y(x)). $$
Now plug this into the original arc length definition to obtain
$$ \int_{x_a}^{x_b} |\alpha'(x)|dx = \int_{x_a}^{x_b}\sqrt{1 + y'(x)^2}\, dx. $$
This is precisely the formula written down by the physicist.
Comment on the Edit. Treating derivatives as difference quotients of small quantities often works because, well, that really is what a derivative is doing. Look at the definition of the derivative as a limit of a difference quotient. If $\Delta x$ is small, then replacing the derivative by the difference quotient won't generally incur a large error, so it's not always such a bad way to look at things. This should, of course, be taken with a grain of salt when you want to rigorously clean everything up in the end, but it's often unproductive to tie your hands and not think about these things intuitively, especially when you're first learning them.
I enjoy and appreciate rigor as much as any other respectable citizen off the street, but I've also learned to appreciate that working loosely with mathematical quantities can often lead to great intuition and insight. Take, for example, path integrals in physics. No one really knows how to define these beasts in a way that would satisfy a modern mathematician (especially path integrals in quantum field theory), but nonetheless physicists' formal manipulations have led to some of the most accurately predicted measurements in human history.