How can we accurately measure a password entropy range?
The best work in this area I've seen is by Matt Weir in Reusable Security: New Paper on Password Security Metrics (2010). He describes the difference between "Shannon entropy" and "guessing entropy". He also has an interesting method of taking a password for a user, analyzing it, and offering suggestions to make it better:
....other methods for password creation policies, including our proposed method to evaluate the probability of a human generated password by parsing it with a grammar trained on previously disclosed password lists. This allows us to build a more robust reject function compared to a simple blacklist, while attempting to provide the most user freedom possible, given the security constraints of the system, when selecting their passwords.
Update: as user185 notes, Appendix A of the NIST Electronic Authentication Guideline from 2006, revised in 2013, is also very helpful. It goes into detail on calculating these two terms:
” As applied to a distribution of passwords the guessing entropy is, roughly speaking, an estimate of the average amount of work required to guess the password of a selected user, and the min-entropy is a measure of the difficulty of guessing the easiest single password to guess in the population.
Note that this question is closely related:
- Recommended policy on password complexity
Appendix A of the NIST Electronic Authentication Guideline details the method they use to construct the entropy vs. password length table A.1, including a few references for further reading.
I think you should account for the various ways in which a password is actually attacked, which is going to take some research. Obviously a password that is wholly composed of an exact match to a common password should have a "very weak" strength (or be disallowed outright). You should probably expand this by searching out "default" or commonly used word lists that the script kiddies will definitely try when cracking a password; there are readily available lists of tens of thousands of passwords (or more) that will definitely be tried against your users if your database is ever leaked. So include those in your "common password" check.
But crackers certainly won't limit themselves to a simple "exact match" search, so neither should your strength meter. Research common patterns used by crackers, such as combining two words from the password dictionary, or substituting letters for numbers or other "1337 speak" types of substitutions (e.g. "p@ssw0rd$ 4r3 4w3$0m3!1"). The howsecureismypassword.net site you mention fails here: it rates "passwordpassword" as taking "345 thousand years" to crack, which is absurdly wrong. I'd guess that will fall in less than a second. These are not the only rules to consider; many passwords follow very simple patterns like {capital letter}{6 lowercase letters}{number}!
which are far less secure than 9 random characters (but still slightly better than a simple dictionary match). A variety of such common patterns will also be tried before resorting to brute force.
How you handle these transformations or word matches is up to you; but regardless they should be accounted for somehow. One thing to look into, is how have open-source tools handled this?
As an example, the quality estimation function for KeePass password manager reportedly handles this by calculating an entropy based on how many patterns are used to create the password, and the strength of that pattern, rather than using character counts, if patterns are detected. In older versions of the software, a naive count-based entropy was simply penalized based on recognized patterns. You could probably do well with either method. The trick will be to keep the patterns up-to-date with advances in cracking, but at the very least accounting for the really basic stuff could have drastic improvements in the strength of your users' passwords, especially if you explain in the interface what common patterns their password is going to be guessed by.