Caffe2: Load ONNX model, and inference single threaded on multi-core host / docker

This is not a direct answer to the question, but if your goal is to serve PyTorch models (and only PyTorch models, as mine is now) in production, simply using PyTorch Tracing seems to be the better choice.

You can then load it directly into a C++ frontend similarly to what you would do through Caffe2, but PyTorch tracing seems more well maintained. From what I can see there are no speed slowdowns, but it is a whole lot easier to configure.

An example of this to get good performance on a single-core container is to run with OMP_NUM_THREADS=1 as before, and export the model as follows:

from torch import jit
### Create a model
model.eval()
traced = jit.trace(model, torch.from_numpy(an_array_with_input_size))
traced.save("traced.pt")

And then simply run the model in production in pure C++ following the above guide, or through the Python interface as such:

from torch import jit
model = jit.load("traced.pt")
output = model(some_input)