Why is writing zeros (or random data) over a hard drive multiple times better than just doing it once?
Summary: it was marginally better on older drives, but doesn't matter now. Multiple passes erase a tree with overkill but miss the rest of the forest. Use encryption.
The origin lies in work by Peter Gutmann, who showed that there is some memory in a disk bit: a zero that's been overwritten with a zero can be distinguished from a one that's been overwritten with a zero, with a probability higher than 1/2. However, Gutmann's work has been somewhat overhyped, and does not extend to modern disks. “The urban legend of multipass hard disk overwrite and DoD 5220-22-M” by Brian Smithson has a good overview of the topic.
The article that started it is “Secure Deletion of Data from Magnetic and Solid-State Memory” by Peter Gutmann, presented at USENIX in 1996. He measured data remanence after repeated wipes, and saw that after 31 passes, he was unable (with expensive equipment) to distinguish a multiply-overwritten one from a multiply-overwritten zero. Hence he proposed a 35-pass wipe as an overkill measure.
Note that this attack assumes an attacker with physical access to the disk and somewhat expensive equipment. It is rather unrealistic to assume that an attacker with such means will choose this method of attack rather than, say, lead pipe cryptography.
Gutmann's findings do not extend to modern disk technologies, which pack data more and more. “Overwriting Hard Drive Data: The Great Wiping Controversy” by Craig Wright, Dave Kleiman and Shyaam Sundhar is a recent article on the topic; they were unable to replicate Gutmann's recovery with recent drives. They also note that the probability of recovering successive bits does not have a strong correlation, meaning that an attacker is very unlikely to recover, say, a full secret key or even a byte. Overwriting with zeroes is slightly less destructive than overwriting with random data, but even a single pass with zeroes makes the probability of any useful recovery very low. Gutmann somewhat contests the article; however, he agrees with the conclusion that his recovery techniques are not applicable to modern disks:
Any modern drive will most likely be a hopeless task, what with ultra-high densities and use of perpendicular recording I don't see how MFM would even get a usable image, and then the use of EPRML will mean that even if you could magically transfer some sort of image into a file, the ability to decode that to recover the original data would be quite challenging.
Gutmann later studied flash technologies, which show more remanence.
If you're worried about an attacker with physical possession of the disk and expensive equipment, the quality of the overwrite is not what you should worry about. Disks reallocate sectors: if a sector is detected as defective, then the disk will not make it accessible to software ever again, but the data that was stored there may be recovered by the attacker. This phenomenon is worse on SSD due to their wear leveling.
Some storage media have a secure erase command (ATA Secure Erase). UCSD CMRR provides a DOS utility to perform this command; under Linux you can use hdparm --security-erase
. Note that this command may not have gone through extensive testing, and you will not be able to perform it if the disk died because of fried electronics, a failed motor, or crashed heads (unless you repair the damage, which would cost more than a new disk).
If you're concerned about an attacker getting hold of the disk, don't put any confidential data on it. Or if you do, encrypt it. Encryption is cheap and reliable (well, as reliable as your password choice and system integrity).
There is a well-known reference article by Peter Gutmann on the subject. However, that article is a bit old (15 years) and newer harddisks might not operate as is described.
Some data may fail to be totally obliterated by a single write due to two phenomena:
We want to write a bit (0 or 1) but the physical signal is analog. Data is stored by manipulating the orientation of groups of atoms within the ferromagnetic medium; when read back, the head yields an analog signal, which is then decoded with a threshold: e.g., if the signal goes above 3.2 (fictitious unit), it is a 1, otherwise, it is a 0. But the medium may have some remanence: possibly, writing a 1 over what was previously a 0 yields 4.5, while writing a 1 over what was already a 1 pumps up the signal to 4.8. By opening the disk and using a more precise sensor, it is conceivable that the difference could be measured with enough reliability to recover the old data.
Data is organized by tracks on the disk. When writing over existing data, the head is roughly positioned over the previous track, but almost never exactly over that track. Each write operation may have a bit of "lateral jitter". Hence, part of the previous data could possibly still be readable "on the side".
Multiple writes with various patterns aim at counterbalancing these two effects.
Modern hard disks achieve a very high data density. It makes sense that the higher data density is, the harder it becomes to recover traces of old overwritten data. It is plausible that recovering overwritten data is no longer possible with today's technology. At least, nobody is currently positively advertising such a service (but this does not mean that it cannot be done...).
Note that when a disk detects a damaged sector (checksum failure upon reading), the next write operation over that sector will be silently remapped to a spare sector. This means that the damaged sector (which has at least on wrong bit, but not necessarily more then one) will remain untouched forever after that event, and no amount of rewriting can change that (the disk electronic board itself will refuse to use that sector ever again). If you want to be sure to erase data, it is much better to never let it reach the disk in the first place: use full-disk encryption.
The answers provided so far are informative but incomplete. Data are stored on a (magnetic) hard disk using Manchester coding, such that it's not whether the magnetic domain points up or down that encodes one or zero, it's the transitions between up and down that encode the bits.
Manchester coding usually starts with a little bit of nonsense data suitable for defining the 'rhythm' of the signal. It's possible to imagine that if your attempt to overwrite the data with all zeroes once wasn't exactly in phase with the timing under which the original data were stored, it'd still be super-easy to detect the original rhythm and edges, and reconstruct all of the data.