Stochastic Integrals are confusing me; Please explain how to compute $\int W_sdW_s$ for example
Since you have lots of questions, my answer will be even longer than yours :). (Throughout my answer I'll use $a \wedge b := \min\{a,b\}$ to denote the minimum of $a$ and $b$.)
"Itô integration" and "stochastic integration"
Actually Itô integration is a particular form of stochastic integration. There are also other ways to define stochastic integrals. However, in some sense, the Itô integral is THE stochastic integral (in the sense that is (one of) the most important one(s)).
$X_t$ is denoted as the stochastic integral, but $X_t = X(t)$ as I understand, so this is also a stochastic process itself?
Yes, that's correct. For each fixed $t \geq 0$,
$$X_t(\omega) = \left( \int_0^t g_s \, dW_s \right)(\omega)$$
is a random variable and the family of random variables $(X_t)_{t \geq 0}$ is a stochastic process.
And what is $dW_s$?
Well, first of all it's simply a notation (yeah, I know that you don't like to hear that). We define $$\int_0^t g_s \, dW_s := \sum_{i} g_i (W_{t_{i+1} \wedge t}-W_{t_i \wedge t}), \tag{1}$$ so the left-hand side is just a notation we have introduced to shorten things.
What is this variable $s$? Is it a time variable?
Yes. If you want to get a little bit more comfortable with this, then have a look at Riemann-Stieltjes integrals; these are integrals of the form
$$\int_0^t h(s) \, df(s)$$
for "nice" functions $h$ where $f:[0,\infty) \to \mathbb{R}$ is a function of bounded variation (this is all deterministic, no dependence on $\omega$!); in particular for step functions of the form
$$h(t) = \sum_{i=1}^n c_i 1_{[t_i,t_{i+1})}(t)$$
this integral is (defined as)
$$\int_0^t h(s) \, df(s) = \sum_{i=1}^n c_i (f(t_{i+1} \wedge t) - f(t_i \wedge t)).$$
For $f(t) = t$ this yields the standard Riemann integral. On the other hand, if we formally plug in $f(t) := W_t(\omega)$ and $h(t) :=g(t,\omega)$ for fixed $\omega$ this gives $(1)$. (Note: This is just a motivation why we define the integral $(1)$ as we do it. The Itô integral is not a Riemann-Stieltjes integral.)
But what is $\omega$ in $g_t(\omega)$
Well, hopefully you do know the basics of probability theory...? $\Omega$ denotes the underlying probability space and, as usual, $\omega \in \Omega$ is an element of this space. It models the randomness. (Note that $g_t$ itself is again a stochastic process.)
What does $g_t^n$ mean? $g_t = g(t)$ to the power of $n$?
No, this is not the power; just a notation. As it is written there we define
$$g_t^n(\omega) := \sum_{i=0}^{n-1} g_{t_i} 1_{[t_i,t_{i+1})}(t), \tag{2}$$
i.e. we use the notation $g_t^n$ to denote the simple process defined by (2). If you are confused by this, then use always the notation $g(t)$ instead of $g_t$, because then we can define
$$g_n(t,\omega) := \sum_{i=0}^{n-1} g_{t_i} 1_{[t_i,t_{i+1})}(t). \tag{3} $$
(I hope you are not even more confused. Basically, the trouble is that we have to put the index $n$ somewhere and if we use the lower index for the time, then the only remaining possibility is to use the upper index.)
Why is this up to $n-1$ and not $n$ or $\infty$?
We are interested in the stochastic integral $\int_0^T W_s \,dW_s$, right? So we want to define an approximation of the function $g(t,\omega) := W_t(\omega)$ on $[0,T]$. If you consider the intervals $[t_i,t_{i+1})$, $i=0,\ldots,n-1$, then you see that they cover the interval $[0,T]$. That's why it is chosen in this way.
Though why $T_i$ and not $i$ as in the definition?
You are confusing several things, the $g$ you have in this example and the $g$ from your definition of the Itô integral. Let me just rewrite the definition and then you'll understand. Our definition states that if $h$ is a simple process of the form
$$h(t,\omega) = \sum_{i \geq 0} h_i(\omega) 1_{[t_i,t_{i+1})}(t) \tag{4}$$
then
$$\int_0^t h(s) \, dW(s) := \sum_{i \geq 0} h_i (W_{t_{i+1} \wedge t}-W_{t_i \wedge t}). \tag{5}$$
Now, for fixed $n \in \mathbb{N}$, our approximating simple process $g_t^n$ (see $(3)$) is of the form $(4)$ where $h_i := g_{t_i} = W_{t_i}$. By our definition $(5)$ this gives the claimed result.
Why $W_T^2$ in particular? [...] And is this $T=\infty$?
No, $T$ is not $\infty$! Right at the beginning of your example you stated your problem:
Find $\int_0^T W_s \, dW_s$
and here $T>0$ is some fixed real number. This is the same fixed number which keeps popping up throughout the whole proof.
$W_T^2 = \sum_{i=0}^{n-1} (W_{t_{i+1}}^2-W_{t_i}^2)$
That's just a telescoping sum.
$$\begin{align*} \sum_{i=0}^{n-1} (W_{t_{i+1}}^2-W_{t_i}^2) &= (W_{t_1}^2-W_{t_0}^2) + (W_{t_2}^2-W_{t_1}^2) + \ldots+ (W_{t_n}^2-W_{t_{n-1}}^2) \\ &= W_{t_{n}}^2 - W_{t_0}^2 = W_T^2 \end{align*}$$
since $t_n = T$ and $W_{t_0} = W_0 = 0$ (because $(W_t)_t$ is a Brownian motion).
saz's answer is very conceptually helpful. I'd like to add that the way we would usually analytically calculate stochastic integrals is to exploit Ito's formula in a clever way. This is a lot like integration by substitution. For instance, to calculate $\int_0^t W_s dW_s$, you guess by analogy with ordinary calculus that the answer might be $\frac{1}{2} W_t^2$. Then Ito's formula tells you
$$\frac{1}{2} W_t^2 = \int_0^t W_s dW_s + \frac{1}{2} t.$$
This extra term is sometimes called the "Ito correction term". An intuitive way of viewing it is that the left side has expectation $\frac{t}{2}$ while the first term on the right side always has expectation zero. That is roughly because it is a limit of linear combinations of mean zero Gaussians. (However, the coefficients are random. The Ito convention, wherein the integrand is evaluated at the left endpoint, and the use of "non-anticipating" integrands allow us to get around this technicality.)
So something with expectation $\frac{1}{2} t$ needs to get added in; it turns out that it is just $\frac{1}{2} t$ in this particular case.
As a result of this, you get
$$\int_0^t W_s dW_s = \frac{1}{2} \left ( W_t^2 - t \right ).$$
This is sort of an "extended comment" rather than a direct answer to all of your questions; your questions all seem to revolve around "what the heck are we doing here, really?!" and while @saz gets my upvote for patiently answering each of your actual questions, I feel like the spirit of the question is perhaps still open.
The first thing to say is that noise -- stochastic mathematics -- is about doing calculus with highly non-differentiable functions. If you've got some sort of electrical circuit, you can describe the differential equations that govern it -- but if the input signal at this end is noisy, then that really means that the differential equations cannot give you the proper answer with standard calculus.
Cumulants, Sums of Random Variables, Central Limit Thm.
First off let's just breeze through a bunch of classical probability, because I want to introduce the fact that variance is pseudo-linear but it's a much more general story.
Suppose we have a random variable $X$ distributed with some PDF $f(x),$ then the Fourier transform of that PDF is called its characteristic function, $$f[k] = \langle e^{-2\pi ikX}\rangle = \int_{-\infty}^{\infty}dx~f(x)~e^{-2\pi i k x}.$$ If $Z = X + Y$ where $X$ and $Y$ are independent, and $Y$ has PDF $g(y)$ and $Z$ therefore has PDF $h(z),$ then it turns out we have $h[k] = f[k]~g[k].$ Therefore the logarithm of a Fourier transform of a PDF is pseudo-linear, in the sense that it is linear in independent random variables.
The Maclaurin series of this logarithm is the cumulant expansion $$\log f[k] = 0 + \mu~(-2\pi i k) + \frac12~\sigma^2~(-2\pi i k)^2 + \dots = \sum_{n=1}^{\infty} \kappa_n ~ {(-2\pi i k)^n \over n!}.$$Each cumulant is likewise pseudo-linear. You can get some special cases by simply expanding out the terms individually, for example, the first derivative is $f'[k]/f[k]$ and at $k=0$ this is $\langle -2 \pi i X\rangle$ so $\mu = \langle X \rangle,$ while the second derivative is $(f[k]~f''[k] - f'[k]~f'[k])/f[k]^2$ which is $(-2\pi i)^2 \big(\langle X^2\rangle - \langle X \rangle^2\big).$ So the first two cumulants are mean and variance.
Furthermore under the scaling $X \mapsto \alpha X$ we can see that $\langle e^{-2\pi i k \alpha X}\rangle = f[\alpha k]$ and thus the effect is to map the cumulants each $\kappa_n \mapsto \alpha^n \kappa_n.$ So that's our scaling law.
The combined effect of these is one of the easiest ways to see the central limit theorem: the sum $Z = \sum_{m=1}^M X_m/M,$ for independent, identically distributed $X_m$ is going to have cumulants $\kappa_n / M^{n-1},$ with one $M$ coming from pseudo-linearity and the rest of the $M^{-n}$ coming from this scaling law. Keeping only the leading two terms gives you the mean and standard deviation, and all other cumulants are zero -- but this quadratic logarithm is characteristic of a Gaussian characteristic function, and the Fourier transform of a Gaussian is a Gaussian, so you get a Gaussian probability distribution.
From pseudo-linear variance to random walks.
So now if we imagine a noise source which is a limit of a random walk process --the sum of lots of steps in random directions over infinitesimal times -- we know that the short autocorrelation time of those steps leads to this sum of that signal having a variance proportional to duration. (In other words, we have $N$ independent steps, since variance is pseudo-linear variance grows proportional to $N$, but at the same time we know that for this walk, $N$ grows linearly with $t$, so the variance needs to go proportional to the time interval of the walk.) So, we can describe a bunch of different things if we start from a perfect white noise source where $dW$ is some little unit of noise with zero mean that obeys the heuristic equation $dW^2 = dt$ -- everything else is additive constants and such. $dW$ is often called a Wiener process or Brownian motion, and it is the integral of a "Gaussian white noise" source. We can also imagine that higher-order differentials are essentially nothing, as the Central Limit Theorem tells us that the higher-order cumulants vanish anyway leaving just the Gaussian distribution.
Step back a second and think about how we have to think about this random walk $dW.$ In practice, we need it to also be a function with respect to time, $dW(t).$ Mathematicians for some reason usually write this as a suffix, $dW_t;$ physicists are more comfortable with writing it with parentheses. So then we have that the stochastic increment in a variable $X$ over a time step is equal to some random noise with standard deviation $\sigma(t)$ and mean $\mu(t),$ or:$$dX_t = \sigma_t ~ dW_t + \mu_t ~ dt.$$ This $X$ is known as an "Itô process." The key result about such processes -- which is the same as our heuristic $dW_t^2 = dt$ above -- is called Itô's lemma: let $f(x, t)$ be twice-differentiable and apply it term-by-term to the time-sequence $X_t,$ then the resulting transformed sequences is another Itô process, and its corresponding stochastic equation is:$$\begin{align}df(X_t, t) \approx&~ f(X_t + dX_t, t + dt) - f(X_t, t)\\ \approx&~ f(X_t, t) + \frac{\partial f}{\partial t} ~dt + \frac{\partial f}{\partial x}~dX_t + \frac12~\frac{\partial^2 f}{\partial x^2} dX_t^2 - f(X_t) \end{align}$$Keeping terms to order $dt$ with our heuristic rule $dW = \sqrt{dt}$ gives:$$ df(X_t, t) = \left(\frac{\partial f}{\partial t} + \frac{\partial f}{\partial x}~\mu_t + \frac12~\frac{\partial^2 f}{\partial x^2} ~\sigma_t^2\right)~dt + \sigma_t~ \frac{\partial f}{\partial x} ~dW_t.$$This can be viewed as a modification of the chain rule to make it apply even when we're dealing with these non-differentiable functions.
From algebra to computer simulations.
So there turns out to be another way to do stochastic calculus which preserves the chain rule, called the Stratonovich integral, and it preserves the chain rule. Why would we use this clumsy Itô formalism with its broken chain rule?!
Well it turns out to be one of those things where there's sort of a bunch of different criteria and it's impossible to satisfactorily get them all right. What Itô does that Stratonovich is bad with, is that once you have the Itô differential equation, you have the computer simulation. This is very powerful. The formal statement of this is given by the Itô integral: choose an integration mesh of $N$ points in time $t_n$, with $T = t_N.$ Then $$\int_0^T H(\tau)~dW(\tau) = \lim_{N \to \infty} ~ \sum_{n=1}^N ~ H(t_n) \cdot \big(W(t_n) - W(t_{n-1})\big).$$(the limit only works, of course, if the maximum size in the mesh shrinks to 0 as the points go to infinity...)
In other words, if you use a fine enough mesh, you can just simulate this by taking random walk steps of the appropriate size (Gaussian random variables with standard deviation $\sqrt{t_n - t_{n-1}}$) and summing up their aggregate response.
So once you know how your differential equations work, you use this $dW = \sqrt{dt}$ heuristic to put some noise through them, keep the leading terms in $dt,$ and then you know how the output responds to a certain noise profile.
Then you run thousands of simulations with your computer to see what actually emerges -- for example, how long your wind turbines last under a realistic noisy load of wind speeds.
Where the field goes from there.
If you work in applied math or the physical sciences, then you'll usually want to determine how this noise looks in the frequency spectrum, how other noise spectra come into the integral, and you may have to deal with a discrete counter or so where $dN$ must be either 0 or 1 and usually the summation of that process will look like a Poisson process, all that good stuff. So the Wiener process is just our most easy-to-handle noise source, but not necessarily our only conceivable one. Studying those $dN$ increments then becomes a task for "queueing theory."
Then there's things like control theory, where you start trying to figure out what the optimal parameters are to control a system which has a noisy input, like a stock market, or a quantum system.
And you can also kind of jump into the deep end of theory -- the late Paul Malliavin forged new ground (I think in the 70s and 80s?) by generalizing calculus-of-variations to a stochastic context, whereupon we discovered that you can do stochastic integration by parts and some guy named Anatoliy Skorokhod discovered that a generalization of the Itô integral is also the infinite-dimensional vector divergence operator.