Why does an anti-forgery token need so many bits?
Your question makes an assumption that should not be made in the field:
wouldn't you be able to detect and lock them out after a few attempts
Yes, in a good working environment there should be a system like this in place that rate-limits failures of various kinds. This is a good thing to do but it should never be your first or only line of defense. If your security model relies on an extra layer like this for its theoretical security, you are introducing an unnecessary point of failure and possible attack vector.
By keeping the the theoretical math that needs to be done out of reach, you reduce the risk posed by things which may be (or eventually end up) out of your control, such as a faulty rate limiter implementation or an inside attack job.
A few other factors which contribute to the need for so much entropy and why they might suggest 30 characters:
Firstly, this is to protect the application side login action – which will then create its own authenticated session independent of the Google one. The important thing is that it's up to the application to store the token and decide when to stop accepting it. If it's kept in a session or a database table then you might continue to accept that token indefinitely and Google can't ensure applications don't do this, so it's easier for them to recommend more entropy.
Secondly, they're proposing the use of rand() in PHP in that example. To be honest this surprises me, because that function doesn't produce cryptographically secure random numbers. In fact according to the docs it might only produce numbers between 0 and 32767 on some systems. This suggests to me that 30 characters isn't absolutely necessary to ensure security, but is simply a good idea.
Thirdly I think "30 characters" is mainly relevant to the context of their example, where they're using rand() pumped into md5(). Given that md5() produces hexadecimal values then there's only 16 possible values per character. Also, md5 digests are 32 characters long, but if you start truncating that then it'll be significantly easier to find a collision.
In short:
- It's easier to suggest high entropy instead of complicated ways for application developers to ensure values are expired securely and reliably.
- If people take their example code (ie. md5(rand())) and start truncating the 32 bit value, then an attacker may only have to find all the distinct n character prefixes of md5 hashes of the values 0 to 32767, where n is what you truncate at. This will probably produce a set of values that could be brute forced within a feasible amount of time.
Looking at the example code on that page:
- The PHP code uses
rand()
, which on some systems is at most 15 bits and may be even less becauserand()
isn't necessarily a good RNG. - The Python code uses 32 characters from an alphabet of 36 (uppercase and digits), which works out to approx. 165 bits assuming that
random.choice
is perfect. Which it may well not be, but we're still looking at a lot more than 15 bits. - The use of 130 bits in the Java code falls between the two. I can't see any obvious motivation for it: the number is output in base 32, so to follow the "about 30 characters" advice I'd expect 150 bits rather than 130. It could conceivably be a typo!
- What is constant among the different examples presented is that the string is "about 30 characters" long as advised (26 for Java, 32 for the others). This one constant piece of advice doesn't tell you much about security, only about the convenience of storing, transmitting and receiving it.
I doubt that there exists a definitive answer to this question, other than to find out how the author of the Java snippet in particular chose the number 130. If this quantity of entropy were genuinely important to Google's security advice, then the Python and PHP examples would use it too. Which they do not.
The general principle at work is that it's cheap to generate, store and transmit these amounts of data. As such there is no benefit, and some risk, to only using as much as you really need. The result is to advise "overkill", although that PHP code falls short.
You ask why not 1024 bits -- well, base-64 encoded that would require 172 characters, and perhaps you could argue that does start to approach significant in the weight of a webpage, although it would be a stretch. I'm not aware of any reason you can't use CSRF tokens that large, it's just not worth advising people to use them. Choosing between 130 and 1024 is a matter (based on experience) of the difference between a comfortable overkill and gluttony. Choosing between 130 and 64 might be more a case of the difference between comfortable overkill and a nagging feeling that maybe you're within an order of magnitude or two of not really being secure to brute force after all.
So when giving general advice about this kind of choice, it's fairly reasonable to analyse a worst-case situation, double the answer (or more), and then check that the result is within reasonable resource limits. If so, just use it. As such you wouldn't expect there to be anything very special about the numbers 30 or 130. It would perhaps have been instructive for Google to show its working.