One-point perspective formula
If you have a 3D coordinate system with origin $(0, 0, 0)$ at your (dominant) eye, the picture plane at $z = d$, and the vanishing point at origin in your picture, then each 3D point $(x, y, z)$ can be trivially projected to your picture plane at $(x', y')$: $$\begin{array}{l} x' = x \frac{d}{z} \\ y' = y \frac{d}{z} \end{array}$$
This is not a trick. It is not an approximation either. If the coordinate systems and the location of the observer are defined this way, you do get the 2D-projected coordinates simply by multiplying them by $d / z$, where $d$ is the distance from the observer to the projection plane (picture), and $z$ is the 3D $z$-coordinate for that point. I've explained why at the beginning of this answer.
If you want the 3D origin to be on the picture plane, with your dominant eye at $(0, 0, -d)$, then $$\begin{array}{l} x' = x \frac{d}{z + d} \\ y' = y \frac{d}{z + d} \end{array}$$
If you want the 3D origin, and the vanishing point, to be at $( x_0 , y_0 )$ on the picture plane, and your eye is at $( x_0 , y_0 , -d )$, then $$\begin{array}{l} x' = x_0 + ( x - x_0 )\frac{d}{z + d} \\ y' = y_0 + ( y - y_0 )\frac{d}{z + d} \end{array}$$
The case where the eye is at $(0, 0, -d)$ but vanishing point at $( x_0 , y_0 )$ on the image plane, is a bit funny and somewhat more complicated, because the picture plane is skewed: it is not perpendicular to the line between the eye and the vanishing point $(0, 0, \infty)$. I haven't worked out the math for that case, because I've never needed to find out how to project an image that would be mostly viewed from the side.
Other projection models can be constructed from the same principles I outlined here, but the corresponding formulae are more complicated.
For example, two- and three-point perspectives can be modeled using a 3D coordinate system where origin is at the origin of the picture plane, but the 3D $x$ and $z$ axes (for two-point perspective), or $x$, $y$, and $z$ axes (for three-point perspective) are towards their respective vanishing points. The formulas for $x'$ and $y'$ are more complicated than above, because the 3D coordinate system is rotated compared to the 2D one.
(After all, vanishing points are nothing special or magical: any point infinitely far away is a vanishing point. When used as a geometric tool for perspective, we simply utilize the fact that all parallel lines meet at the same vanishing point. If you wanted to draw a picture of a quarter henge with say five monoliths, you could use five vanishing points, one for each monolith.)
Non-planar projections, for example projecting on a cylinder, are derived the same way using optics and linear algebra: simply put, by finding where the line between the eye and the detail we are interested in, intersects the projection surface.
Matrices are only used in this kind of math, because they let us write the operations needed in more concise form: in shorthand, in a way. In fact, if you can add, subtract, multiply, divide, and use up to three variables in a formula, you know enough math to start delving into descriptive geometry. If you then learn about vectors, matrices, and versors, you can quickly master the principles used in 3D graphics, ray tracing, and descriptive geometry applications in general.
Using the fact that the columns of a transformation matrix are the images of the domain’s basis, we can immediately write down a matrix for the projection represented by the image: $$P=\begin{bmatrix}1&0&-\mu x_P&0\\0&1&-\mu y_P&0\\0&0&-\mu&1\end{bmatrix}.$$ The first two columns say that the vanishing points of the $x$- and $y$-axes are still at infinity (and that there’s no scaling in those directions), and the last column represents the origin’s image, which I’m assuming is the origin in the image’s coordinate system. The third column is the $z$-axis vanishing point, negated because we’re sighting “down” along the $z$-axis. It also includes a “foreshortening factor” $\mu$ which you’ll need to calibrate to get the desired spacing.
The image of any 3-D point is the product of $P$ and the point’s homogeneous coordinates: $$\begin{bmatrix}1&0&-\mu x_P&0\\0&1&-\mu y_P&0\\0&0&-\mu&1 \end{bmatrix}\begin{bmatrix}x\\y\\z\\1\end{bmatrix} = \begin{bmatrix}x-\mu x_Pz\\y-\mu y_Pz\\1-\mu z\end{bmatrix}.$$ These are the homogeneous coordinates of the image point, so we convert to Cartesian by dividing through by the last component. Thus, we have the mapping $$(x,y,z)\mapsto\left({x-\mu x_Pz\over1-\mu z},{y-\mu y_Pz\over1-\mu z}\right).\tag{*}$$ So, to find where the edges of the cubes in the stack should be, plug appropriate values of $x$, $y$ and $z$ into this formula.
For example, with the vanishing point at $(3,2)$ as pictured, we have $(1,0,z)\mapsto\left({1-3\mu z\over1-\mu z},{-2\mu z\over1-\mu z}\right)$. The images of the vertices will be at $z=0,-1,-2,-3,\dots$. This gives you the correct proportional spacing, but there’s still the parameter $\mu$ that you need to set to get a particular set of vertices. We can try to match your illustration: For $z=-1$, the image point is $\left({1+3\mu\over1+\mu},{2\mu\over1+\mu}\right)$. In your illustration, the $x$-coordinate of this vertex looks like it’s at about $1.4$, which corresponds to $\mu=0.25$. With this value for $\mu$, the unit tick marks along this line will be at $(1.4,0.4)$, $(1.67,0.67)$, $(1.86,0.86)$, $(2,1)$, and so on. We find the corresponding points along the upper right edge by mapping $(1,1,z)$ using the computed value of $\mu$: $(1.4,1.2)$, $(1.67,1.33)$, $(1.86,1.43)$, $(2,1.5)$, &c. The $x$-coordinates of these two sequences are identical, as one would expect.