Why do we need perspective division?
I mean why do we need that?
In layman terms: To make perspective distortion work. In a perspective projection matrix, the Z coordinate gets "mixed" into the W output component. So the smaller the value of the Z coordinate, i.e. the closer to the origin, the more things get scaled up, i.e. bigger on screen.
Some details that complement the general answers:
The idea is to project a point (x,y,z) on screen to have (xs,ys,d). The next figure shows this for the y coordinate.
We know from school that
tan(alpha) = ys / d = y / z
This means that the projection is computed as
ys = d*y/z = y /w
w = z / d
This is enough to apply a projection. However in OpenGL, you want (xs,ys,zs) to be normalized device coordinates in [-1,1] and yes this has something to do with clipping.
The extrema values for (xs,ys,zs) represent the unit cube and everything outside it will be clipped. So a projection matrix usually takes into consideration the clipping limits (Frustum) to make a single transformation that, with the perspective division, simultaneously apply a projection and transform the projected coordinates along with the z to normalized device coordinates.