OpenGL 4.1 and 3.1+, What are key differences?

If your question is "How can the workflow be better in 4.1", that's simply not what 4.1 is about.

First, a quick definition, to make sure we're talking about the same thing. For me, "workflow" means API improvements and things that make performance better. These don't allow the hardware to do anything you couldn't before; they just make it easier for the programmer or let you get faster performance.

The vast majority of the API improvements, the ones that aren't based on new features, are available to 3.3 implementations as core extensions. Since they are core extensions, you don't even have to change your code to remove the "ARB" suffix from your 3.3 code to use them in 4.1 code. It all just works. In particular, I'm talking about program separation (GL_ARB_separate_program_objects) and retrieving binaries of compiled programs (GL_ARB_get_program_binary). Both are supported on 3.3 hardware; NVIDIA even extends these all the way back to GeForce 6xxx chips.

The main exception to this is shader subroutines, which is limited to 4.x hardware. However, this specification is so poorly specified that I'm not sure that anyone even can use it, let alone should. It is convoluted and somewhat confusing.

There isn't much that one could easily use to boost performance that is unique to 4.1. Bindless rendering (GL_NV_vertex_buffer_unified_memory) is probably the biggest performance enhancement if that is a bottleneck for you. As you probably noticed from the name, it is an NVIDIA extension and not core. I'm sure the ARB is working on something not entirely unlike this for a core feature in a future spec. And Bindless isn't unique to 4.x hardware; again, NVIDIA extends this all the way back to GeForce 6xxx chips.

There are some things in 4.x that can enhance hardware, but they all ultimately revolve around some form of GPGPU work. Indirect rendering (GL_ARB_draw_indirect) would be a good speedup if you are generating rendering data from OpenCL. And Civilization V has already shown the value of using GPGPU technologies (they use DXCompute, but you could do it with OpenCL too) to decompress textures; this helps tremendously in loading performance, as you don't have to load as much data from disk.

If you want to really stretch the definition of performance improvement, Tessellation could be considered a performance enhancement. You could use it to send smaller meshes, or use your lower LOD meshes closer to the camera. Or you could consider it a way to render higher polygon meshes than you ever could before.

4.x really isn't about providing hardware features that make things go faster. It's more about being able to render in different ways than you could before.

And one more thing: there isn't a choice between 3.1 and 3.3. Pretty much any hardware that could run 3.1 can run 3.3. If it doesn't, then that's because the hardware maker is slacking off on their OpenGL drivers (I'm looking at you, Intel).