Why the 6 in relu6?
Tensorflows documentation (https://www.tensorflow.org/api_docs/python/tf/nn/relu6) points to the following paper:
... First, we cap the units at 6, so our ReLU activation function is y = min(max(x, 0), 6). In our tests, this encourages the model to learn sparse features earlier. In the formulation of [8], this is equivalent to imagining that each ReLU unit consists of only 6 replicated bias-shifted Bernoulli units, rather than an infinute amount. We will refer to ReLU units capped at n as ReLU-n units.
http://www.cs.utoronto.ca/~kriz/conv-cifar10-aug2010.pdf
Since it originates from the paper, I suspect that they tested it with different n's and got the best results for their testset with n=6.
From this reddit thread:
This is useful in making the networks ready for fixed-point inference. If you unbound the upper limit, you lose too many bits to the Q part of a Q.f number. Keeping the ReLUs bounded by 6 will let them take a max of 3 bits (upto 8) leaving 4/5 bits for .f
It seems, then, that 6 is just an arbitrary value chosen according to the number of bits you want to be able to compress your network's trained parameters into. As per the "why" only the version with value 6 is implemented, I assume it's because that's the value that fits best in 8 bits, which probably is the most common use-case.