Why does the power rule work?
First let's try to understand why the derivative of the function $f$ given by $f(x) = x^2$ is equal to $2x$ and not to $x$. (The product rule and the power rule are both generalizations of this.)
Imagine that you have a square whose sides have length $x$. Now imagine what happens to its area if we increase the length of each side by a small amount $\Delta x$. We can do this by adding three regions to the picture: two thin rectangles measuring $x$ by $\Delta x$ (say one on the right of the square and another on the top) and one small square measuring $\Delta x$ by $\Delta x$ (say added in the top right corner.) So the change in the area $x^2$ is equal to $2x \cdot \Delta x + (\Delta x)^2$. If we divide this by $\Delta x$ and take the limit as $\Delta x$ approaches zero, we get $2x$.
So geometrically what is happening is that the small square in the corner is too small to matter, but you have to count both rectangles. If you only count one of them, you will get the answer $x$; however, this only tells you what happens when you lengthen, say, the horizontal sides and not the vertical sides of your square to get a rectangle. This is a different problem than the one under consideration, which asks (after we put it in geometrical terms) how the area varies as we lengthen all the sides.
If you know the product rule, you can derive this for when $u$ is a positive integer, which should give you a basic intuitive understanding.
Suppose, as an example, that $f(x) = x^2$. Equivalently, $f(x) = x*x$. By using the product rule, we have $$f(x) = (1)(x) + (x)(1) $$ $$= 2x$$
More generally, suppose that $f(x) = x^u$. Suppose for now that $u$ is a positive integer, which allows us to expand like this: $$f(x) = \underbrace{(x)(x)...(x)(x)}_{u\text{ terms}}$$ Using the product rule again, we can say $$f(x) = \underbrace{\underbrace{(1)(x)...(x)(x)}_{u \text{ terms}} + (x)(1)...(x)(x) + ... + (x)(x)...(1)(x) + (x)(x)...(x)(1)}_{u\text{ terms}}$$ which simplifies to $$f(x)=ux^{u-1}$$
For positive integers $n$, we can use the Binomial Theorem. Let $f(x)=x^n$. We want to find the slope of the tangent line to $y=f(x)$ at $x=a$. So take a very small $h$, and calculate $$\frac{(a+h)^n-a^n}{h}.$$ By the Binomial Theorem, $(a+h)^n=a^n+na^{n-1}h +\binom{n}{2}a^{n-2}h^2+\cdots$.
Since $h$ is tiny, $h^2$, $h^3$, and so on are negligible compared to $h$. Thus $$\frac{(a+h)^n-a^n}{h}\approx na^{n-1}.$$