How does correct grammar affect password security?

In general, any information which can narrow the search space for a password will reduce the strength of that password. So, in theory, it would make sense to assume that grammatically correct passwords are potentially weaker than those which are a collection of unrelated words or have deliberate grammatical errors. However, calculating exactly what the differences would be is extremely hard.

Many password cracking programs will allow you to define complex patterns. For example, it has been observed that people often use dates as a way of including numbers in a password i.e. password1961 or even password171067 (or password101767 for US date), so some password crackers will search for things like [dictonary-word][year] and [dictionary-word][date], where the numbers in the year/date will be restricted to digits which would be valid and within an 'expected' range (i.e. assume year/date is related to current/recent period or users date of birth etc). Likewise, studies of passwords indicate people tend to do things like put 'special' characters at the boarders of password components i.e. password:1972. This might suggest it would be a good idea to not use digits of this format and consider using 3, 5, 7 digit numbers rather than 4 or 6 and if you add special/punctuation characters, do so in /unusual/ positions i.e. pas:sword1972 (and of course, don't use 'password' :-(

As a cracker, the challenge of using grammar would be in how to model it. For example, English has a very complex grammar. This is partially why natural language processing is such a challenge. Theoretically, if you could define the grammar with sufficient accuracy and had a large enough dictionary, you could generate a system which could produce a dictionary of /valid/ sentences. However, this would still represent a very large search space. If you know exactly how many characters are in the password, this would help reduce the search space, but it would be very large. What would need to be determined is how much smaller such a dictionary would be compared to a similar dictionary just consisting of random words concatenated together. It would be smaller, but whether it would be sufficiently smaller to make any practical difference is unknown. If the grammar based dictionary meant an average search time of 50 years and the random word dictionary represented an average search time of 500 years, in reality, neither approach are going to be useful (assuming there are no other optimization which could reduce the time to a level which would be practical).

Rather than a grammar based dictionary, I would probably compile a dictionary based on quotes, well-known poetry and song lyrics. My theory is that when people use a phrase as a password, that phrase will be something which is easy to remember and therefore likely based on a song, poem or favourite quote. This would be an even smaller dictionary. The challenge would be in building the database and ensuring it is sufficiently comprehensive. Probably something which is getting easier given all the digital repositories of quotes, songs, poetry etc out there.

Personally, I wouldn't worry about this too much. Obviously, it would be best not to use well know phrases and it would be quite important to not use a phrase which someone might be able to identify via social engineering techniques. If your a military person, don't use famous military quotes/speeches, if you're a Christian, avoid using quotes from the Bible, if your an obsessive fan of some singer/actor/whatever, avoid using quotes from that person. Essentially, avoid using anything which anyone who does some research on you might be able to use to narrow the search space. I would also suggest using as long a phrase as possible. A very long known quote is probably stronger than a shorter set of random words simply because the search space is larger.

If you can remember a random set of words, then do that. However, if you can't, then use a grammatically correct phrase, but make it as long as possible. Remembering the password is probably as critical has ensuring it is strong. I've frequently found the weakest part of many systems is their password recovery process and think you should do as much as you can to avoid ever needing to use such a process. Enter the phrase in reverse order (or some other pattern you can remember), insert special characters and numbers into the words, not between them and avoid number patterns such as 2, 4 and 6 digits and stay away from quotes/phrases which someone could associate with you.


The research paper Effect of Grammar on Security of Long Passwords answers your question. The following is the abstract from the paper.

Use of long sentence-like or phrase-like passwords such as "abiggerbetterpassword" and "thecommunistfairy" is increasing. In this paper, we study the role of grammatical structures underlying such passwords in diminishing the security of passwords. We show that the results of the study have direct bearing on the design of secure password policies, and on password crackers used for enforcing password security. Using an analytical model based on Parts-of-Speech tagging we show that the decrease in search space due to the presence of grammatical structures can be more than 50%. A significant result of our work is that the strength of long passwords does not increase uniformly with length. We show that using a better dictionary e.g. Google Web Corpus, we can crack more long passwords than previously shown (20.5% vs. 6%). We develop a proof-of-concept grammar-aware cracking algorithm to improve the cracking efficiency of long passwords. In a performance evaluation on a long password dataset, 10% of the total dataset was exclusively cracked by our algorithm and not by state-of-the-art password crackers.


First of all: if you're selecting words non-randomly (to follow grammar rules, for example), then this isn't an XKCD-style password at all. From my understanding, "XKCD-style" just means diceware with a smaller word list.

One problem with grammatically-correct sentences is, that unless they are meaningless nonsense, they're probably quite predictable. I don't know exactly how predictable, but I do know that basically anything in print anywhere is insecure as the basis for a password. So, you'll need to somehow generate a unique phrase that nobody will have uttered before.

That said, you should still be able to make a secure passphrase that (loosely) follows grammar rules by making randomized nonsense phrases like a Mad-Lib. Just do diceware using a different wordlist for each word. For example, your passphrase generator could generate passwords in the form:

{article} {adjective} {noun} {adverb} {verb} {article} {adjective} {noun}, {exclamation}{punctuation}

For example, "The stylish aardvark stupidly flings a lumpy blimp, yikes!"

I think a "sentence" like that would be much easier to remember than 6 completely random words all jumbled together, but obviously your total word list must be larger to achieve similar security.

You could have a list of 4096 each of nouns, adverbs, verbs, and adjectives (i.e. 16384 words total). We'll make it simple and say you have 2 punctuation marks (. or !) and 32 common exclamations ("oh my", "uh-oh", "rats", etc.). So you can calculate the entropy as:

1 + 12 + 12 + 12 + 12 + 1 + 12 + 12 + 5 + 1 = 80

Note that it's the size of the word lists for each word position that fully determines the entropy. The attacker could know exactly how you generate your password, and unless you're really unlucky and you manage to generate a common song lyric, you can still predict just how secure the password is just like the XKCD-style rule.

The key is that each word must be truly random, and either each word list must be large or you must make very long phrases.

It's probably easier to get large word lists that don't break down words by part of speech, and it's easier to distribute ONE word list and one easily understood rule, but the math should be exactly the same.

Disclaimer: I'm not a security expert, but I think I understand the math and the concepts involved here.