Are Smalltalk bytecode optimizations worth the effort?

I think that a broader question is worth answering: are bytecodes worth the effort? Bytecodes were thought as a compact and portable representation of code that is close the target machine. As such, they are easy to interpret, but slow to execute.

Bytecodes do not excel in any of these games, and that usually makes them not the best choice if you want to either write an interpreter or a fast VM. On one hand, AST nodes far easier to interpret (only a few node types vs lots of different bytecodes). On the other hand, with the advent of JIT compilers, it became clear that running native code instead is not only possible but also much faster.

If you look at the most efficient VM implementations of JavaScript (which can be considered the most modern compilers of today) and also Java (HotSpot, Graal), you'll see they all use a tiered compilation scheme. Methods are initially interpreted from the AST, and only jitted when they become a hot spot.

At the hardest tiers of compilation there are no bytecodes. The key component in a compiler is its intermediate representation, and bytecodes do not fulfill the required properties. The most optimizable IRs are much more fine grained: they are in SSA form, and allow specific representation of registers and memory. This allows for much better code analysis and optimization.

Then again, if you are interested in portable code, there isn't anything more portable than the AST. Besides, it's easier and more practical to implement AST-based debuggers and profilers than bytecode-based ones. The only remaining problem is compactness, but in any case you can implement something like ast-codes (coded asts, similar to bytecodes but representing the tree)

On the other hand, if you want full speed, then you'll go for a JIT with a good IR and no bytecodes. I think that bytecodes don't fill many gaps in today VMs, but still remain mostly for backwards compatibility (also there are many examples of hardware archiqutectures that directly execute Java bytecodes).

There are also some cool experiments with the Cog VM related with bytecodes. But from what I understand they transform the bytecode into another IR for optimizing, then they convert back to bytecodes. I'm not sure if there's a technical gain in the last conversion besides reusing the original JIT architecture, or if there actually is any optimization at the bytecode level.


The main annoyance when you start playing with such optimizations is debugger interface.

Historically and still currently in Squeak, the debugger is simulating the bytecode level and needs to map the bytecodes to corresponding Smalltalk instruction.

So I think the gain was too low for justifying complexification, or even worse degradation of debugging facility.

Pharo wants to change the debugger to operate at a higher level (Abstract Syntax Tree), but I don't know how they will end up at bytecode which is all the VM knows of.

IMO, this kind of optimization might better be implemented in the JIT compiler which transforms bytecode to machine native code.

EDIT

The greatest gains are in eliminating the sends themselves (by inlining) because they are much more expensive (x10) than the stack operations - there are 10 times more bytecodes executed per second than sends when you test 1 tinyBenchmarks (COG VM).

Interestingly, such optimizations could take place in the Smalltalk image, but only on hotspot detected by VM, as in the SISTA effort. See for example https://clementbera.wordpress.com/2014/01/22/the-sista-chronicles-iii-an-intermediate-representation-for-optimizations/

So, in the light of SISTA, the answer is rather: interesting, not yet addressed, but actively studied (and work in progress)!

All the machinery for de-optimizing when the method has to be debugged still is one of the difficult points as I understand it.