Why would it be beneficial to have a separate projection matrix, yet combine model and view matrix?
Look at it practically. First, the fewer matrices you send, the fewer matrices you have to multiply with positions/normals/etc. And therefore, the faster your vertex shaders.
So point 1: fewer matrices is better.
However, there are certain things you probably need to do. Unless you're doing 2D rendering or some simple 3D demo-applications, you are going to need to do lighting. This typically means that you're going to need to transform positions and normals into either world or camera (view) space, then do some lighting operations on them (either in the vertex shader or the fragment shader).
You can't do that if you only go from model space to projection space. You cannot do lighting in post-projection space, because that space is non-linear. The math becomes much more complicated.
So, point 2: You need at least one stop between model and projection.
So we need at least 2 matrices. Why model-to-camera rather than model-to-world? Because working in world space in shaders is a bad idea. You can encounter numerical precision problems related to translations that are distant from the origin. Whereas, if you worked in camera space, you wouldn't encounter those problems, because nothing is too far from the camera (and if it is, it should probably be outside the far depth plane).
Therefore: we use camera space as the intermediate space for lighting.
In most cases your shader will need the geometry in world or eye coordinates for shading so you have to seperate the projection matrix from the model and view matrices.
Making your shader multiply the geometry with two matrices hurts performance. Assuming each model have thousends (or more) vertices it is more efficient to compute a model view matrix in the cpu once, and let the shader do one less mtrix-vector multiplication.