Attack against OTP Cipher

How does a One-Time Pad work?

Imagine you have a message M which is encrypted with a key K, which then results in a ciphertext C. Let us assume that the process through which the encryption occurs is an XOR, which I'll show by the ^ symbol. Furthermore, we assume M, K and C all have the same length. So we know:

M ^ K = C

Assuming an attacker has access to C and wants to recover M, this is cryptographically impossible. Why? Because even if an attacker tried every single possible key K, they would receive every single possible message M'. It is impossible for them to tell which is the correct message. In fact, they could simply look at every single possible message of that length and they would not be any wiser.

However...

Why is it called One-Time Pad?

Because the same key can only be used once. If you use it twice, certain issues arise.

Let's take the same scenario as above, but now we have two messages M1 and M2, one key K and two cipher C1 and C2.

M1 ^ K = C1
M2 ^ K = C2

The curious thing about XOR is that it is a "reversible" operation. That means XOR'ing something with the same value twice results in the original value:

X ^ X = 0
X ^ 0 = X

therefore

X ^ Y ^ X = Y

Order of operations does not matter, just like in addition. Let's assume that the attacker has C1 and C2, but not access to M1 and M2.

C1 ^ C2 = (M1 ^ K) ^ (M2 ^ K)

Since order of operations does not matter, we can remove the parenthesis and group the K together:

C1 ^ C2 = (M1 ^ M2) ^ (K ^ K)

We learned above that a value XOR'd by itself is 0, and that a value XOR'd by 0 is itself. As such, we can simply remove the right parenthesis and get:

C1 ^ C2 = M1 ^ M2

This means that you now have access to two plaintext messages XOR'd to each other. This is a lot more information than just the ciphertexts.

How can I go from there?

Let us assume that a message is only uppercase ASCII and spaces, and you know the following:

     C1 = 55 3a 90 26 b3 b6 48 37 6f c1 45 f7 e8 47 61 78 21 52
     C2 = 42 33 97 55 ca b0 4e 37 61 ca 24 f9 e6 34 66 71 2f 44
C1 ^ C2 = 17 09 07 73 79 06 06 00 0e 0b 61 0e 0e 73 07 09 0e 16

At first glance, this may not seem to tell us a lot. After all, it's just some hexadecimal, right?

Well, if you look at the values in the last line, you see some low values and some high values. Let's look at the high values, the first one being 0x73, which is the ASCII value of the lowercase s. Why is this interesting? We know that it's the result of the XOR of the message M1 with M2, and both can only be uppercase and spaces. If you look closely at the ASCII value of a space, you see it's 0x20 or 0010 0000 in binary. Meaning that XOR'ing with a space only flips one bit. If you look at the ASCII table, you will notice that uppercase and lowercase characters also only differ by one bit. This was done so that "to Upper" and "to Lower", as well as "toggle case" functions only had to operate on one bit.

So we know the following:

Either of the following is true:

  • The fourth byte of M1 is S
  • The fourth byte of M2 is _
  • The fourth byte of K is 0x75

or

  • The fourth byte of M2 is S
  • The fourth byte of M1 is _
  • The fourth byte of K is 0x06

(Note that I am using _ to represent a space for better visibility)

We cannot yet tell which one of these is true, but we know they are mutually exclusive. If you have more messages C2, C3, etc. all encrypted by the same key, then you can simply determine which one is the one with the space, as all others will return either valid lowercase ASCII symbols or 0x00.

In fact, let's assume you intercepted a third message C3, with the value 4826f938d2a63b596dcc45f8e8347778354e

Now you know:

     C1 = 55 3a 90 26 b3 b6 48 37 6f c1 45 f7 e8 47 61 78 21 52
     C2 = 42 33 97 55 ca b0 4e 37 61 ca 24 f9 e6 34 66 71 2f 44
     C3 = 48 26 f9 38 d2 a6 3b 59 6d cc 45 f8 e8 34 77 78 35 4e

C1 ^ C2 = 17 09 07 73 79 06 06 00 0e 0b 61 0e 0e 73 07 09 0e 16
C1 ^ C3 = 1d 1c 69 1e 61 10 73 6e 02 0d 00 0f 00 73 16 00 14 1c
C2 ^ C3 = 0a 15 6e 6d 18 16 75 6e 0c 06 61 01 0e 00 11 09 1a 0a

This already gives us a bit more information. A lot more, in fact. First of all, let's have a look at the above hypothesis: We know either M1[4] is _ or M2[4] is _. Since C1 ^ C3 (and thus M1 ^ M3) does not result in a lowercase ACII character, we know both M1[4] and M3[4] are not spaces. Therefore we know that the first of our two hypothesized cases is true, and we know a bit more about the key and the other messages:

     C1 = 55 3a 90 26 b3 b6 48 37 6f c1 45 f7 e8 47 61 78 21 52
     C2 = 42 33 97 55 ca b0 4e 37 61 ca 24 f9 e6 34 66 71 2f 44
     C3 = 48 26 f9 38 d2 a6 3b 59 6d cc 45 f8 e8 34 77 78 35 4e

     M1 = ?? ?? ??  S ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
     M2 = ?? ?? ?? __ ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
     M3 = ?? ?? ??  M ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??

      K = ?? ?? ?? 75 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??

C1 ^ C2 = 17 09 07 73 79 06 06 00 0e 0b 61 0e 0e 73 07 09 0e 16
C1 ^ C3 = 1d 1c 69 1e 61 10 73 6e 02 0d 00 0f 00 73 16 00 14 1c
C2 ^ C3 = 0a 15 6e 6d 18 16 75 6e 0c 06 61 01 0e 00 11 09 1a 0a

We can also see that M1[3] ^ M3[3] and M2[3] ^ M3[3] result in printable characters, so we know M3[3] must be _ and therefore K[3] = C3[3] ^ 0x20, which is d9.

After doing all of this, your grid should look like this:

     C1 = 55 3a 90 26 b3 b6 48 37 6f c1 45 f7 e8 47 61 78 21 52
     C2 = 42 33 97 55 ca b0 4e 37 61 ca 24 f9 e6 34 66 71 2f 44
     C3 = 48 26 f9 38 d2 a6 3b 59 6d cc 45 f8 e8 34 77 78 35 4e

     M1 = ?? ??  I  S __ ??  S __ ?? ?? __ ?? ??  S ?? ?? ?? ??
     M2 = ?? ??  N __  Y ??  U __ ?? ??  A ?? ?? __ ?? ?? ?? ??
     M3 = ?? ?? __  M  A ?? __  N ?? ?? __ ?? ?? __ ?? ?? ?? ??

      K = ?? ?? d9 75 93 ?? 1b 17 ?? ?? 65 ?? ?? 14 ?? ?? ?? ??

C1 ^ C2 = 17 09 07 73 79 06 06 00 0e 0b 61 0e 0e 73 07 09 0e 16
C1 ^ C3 = 1d 1c 69 1e 61 10 73 6e 02 0d 00 0f 00 73 16 00 14 1c
C2 ^ C3 = 0a 15 6e 6d 18 16 75 6e 0c 06 61 01 0e 00 11 09 1a 0a

Are we stuck now?

This is as much as you can infer with 100% certainty. Now you can start to make some educated guesses. For example, you can see that M2 there is a three-letter word starting with Y and ending in U. You can make an educated guess here and assume that means "you". And indeed, if we assume that, we'd get the following:

     M1 = ?? ??  I  S __  I  S __ ?? ?? __ ?? ??  S ?? ?? ?? ??
     M2 = ?? ??  N __  Y  O  U __ ?? ??  A ?? ?? __ ?? ?? ?? ??
     M3 = ?? ?? __  M  A  Y __  N ?? ?? __ ?? ?? __ ?? ?? ?? ??

      K = ?? ?? d9 75 93 ff 1b 17 ?? ?? 65 ?? ?? 14 ?? ?? ?? ??

The other characters seem to fit. M1 forms the word "is" and M3 forms the word "may". Since those are legitimate English words, you can be pretty certain that those are correct. Furthermore, note that some of the XOR'd messages result in 0x00, which means that these must be the same letter. For example, even though you don't know what M1[13] ^ M2[13] is, you know that both of these letters must be identical.

Keep going from there and see if you can crack the rest.


I will introduce a modern attack on (IV, key) pair re-use in Stream Ciphers and key reuse on OTP.

  • A Natural Language Approach to Automated Cryptanalysis of Two-time Pads by Mason at al. in 2006

Bolds are mine, and the abstract;

While keystream reuse in stream ciphers and one-time pads has been a well known problem for several decades, the risk to real systems has been underappreciated. Previous techniques have relied on being able to accurately guess words and phrases that appear in one of the plaintext messages, making it far easier to claim that “an attacker would never be able to do that.” In this paper, we show how an adversary can automatically recover messages encrypted under the same keystream if only the type of each message is known (e.g. an HTML page in English). Our method, which is related to HMMs, recovers the most probable plaintext of this type by using a statistical language model and a dynamic programming algorithm. It produces up to 99% accuracy on realistic data and can process ciphertexts at 200ms per byte on a $2,000 PC. To further demonstrate the practical effectiveness of the method, we show that our tool can recover documents encrypted by Microsoft Word 2002

Now, the two-time pad (or many-time pad) attack is no more a hand job.

  • Never ever re-use a key in OTP and similarly
  • never ever use (key,IV) pair in CTR, OFB, CFB, ChaCha series, GCM...