How to publish scanned documents anonymously?
Publishing scans without being identified is a tough proposition. There are multiple risks of information leak, and mitigation is technically complex. However, anyone determined to do so can learn the appropriate techniques, and there is free software to accomplish the task.
Disclaimer: Although I consider myself technically knowledgeable about the mentioned issues and I've included references where they exist, some parts of this answer are speculative.
Risks:
Do scanners add any visual unique fingerprint (or even worse: information about the connected device etc.) to every scanned page?
This seems likely, considering that some printers do so. There isn't much information available on scanners, though.
Do scanners add any digital (e.g. binary) fingerprint (or even worse: information about the connected device etc.) to every scanned file?
If you're doing a scan from an attached PC (as your question implies), the answer is no, the scanner can't. Scanners attached to a PC transfer raster image data, not files, so it can't possibly add data to a file it doesn't have access to.
However, you should consider that a digital fingerprint could be added on the scanning software of the PC.
Also, if the scanner is standalone (it saves files to a USB drive, or sends them by email), this is a definite possibility.
Do scanners have a unique 'technical unavoidable' fingerprint, so every scanner scans differently? And is this fingerprint computable or even stored somewhere? Or does the 'institution' that wants to deanonymize me have to have access to my scanner to make an comparison?
Yes. Most modern scanners use CCD sensors, which are uniquely identifiable by their noise pattern, using specialized software.
Other plausible visual fingerprinting targets:
- Lighting pattern. Usually the scanner sensor bar has LEDs on it to illuminate the page. The number and distribution of leds will differ amongst models.
- the paper fiber distribution of the scanned page
- image distortion, caused by unique stepper motors (try scanning a piece of graphing paper)
Using these kind of fingerprinting techniques, it seems likely that the scanner model and paper type can be identified from the scans, but identifying the specific scanner and paper page used would be hard (perhaps impossible) without access to them for comparison purposes.
Do PDFs 'store' any information related to the host computer in them?
Yes, there's even a NSA article about it. While dealing with scanned documents, you'll need to be aware of image file metadata, which can also be present on PNG and JPG files, for example.
Another risk that you didn't mention is that the scanner itself may store a copy of your scan. Big printers do
Of course, this isn't a exhaustive list of risks - merely what has come to my mind in the couple of minutes it has taken me to write this answer. I'm pretty sure researchers, intelligence agencies and police paid to do so can come up with better ideas!
Mitigation
The easiest, safest and obvious mitigations are don't use a scanner that can be tied to your identity, and destroy the scanner after the fact. Of course, this is not always attainable, so what else can you do to protect yourself?
Don't use a stand-alone scanner - especially a networked one. If you really must, convert its output to a pure image without metadata.
For (at least partially) mitigating fingerprints added by software, you'll want to use open source software, both for the OS and the scanning program.. Avoid using your personal PC for scanning, or at least, use a secure live OS
For detecting deliberate visual fingerprinting, the best option would be to scan a blank page and look for obvious anomalies. These might be very small, so you may want to use a image editor to crank up the contrast.
For sensor, paper and visual fingerprinting in general, you want to destroy subtle scanning artifacts. Use a image editor to:
- Add noise
- Use a noise reduction filter (with aggressive reduction)
- Rotate
- Distort the image (by applying multiple camera "lens correction", for example)
- Convert the image to grayscale
- increase the contrast (or, preferably, completely convert to black-and-white)
- Reduce resolution (preferably by a near-to-irrational factor)
- Compress the image (high JPEG compression, for example)
In general, do everything you can to obfuscate and reduce the amount of information contained in the image while keeping the document reasonably readable.
Finally, after all the other steps, remove the medatadata from your files. You can use specialized software to do this.
Buy the scanner in cash, and buy a PC from some PC junker shop in cash. Make sure you never input any information about your name etc into the computer. If everything is bought in cash, and you have a virgin OS with only alias information about yourself, then there should be no correct metadata to encode.
There are certain programs which do encode metadata, Microsoft Word, and other Microsoft products. I think even text files have operating system metadata associated with them. I can't see any software ever encoding an IP address or something of that nature as metadata, that would be a little more invasive than normal.
Programmatically it is possible to scrub metadata from files etc, it just requires a little bit of effort. Images almost always have some form of metadata, such as GPS if it is taken from a mobile device, but I can't see scanners having GPS chips. It would be a little bit of a waste wouldn't it?
PDF's will probably have a lot of meta data associated with them, they would have to get the user's information from somewhere though.
Another thing that would aid in preventing metadata from being transferred would be a lack of connection to the internet. If the programs can't phone home then they can't initialize certain metadata like location etc. I realize this talks a little bit less about the actual metadata than you would like, sorry about. I am an entry level programmer, but I have had some classes in computer forensics as well as computer programming. I hope this helps.
Don't do it.
Forget about it.
If the documents that you are trying to surreptitiously reveal are sensitive enough to demand that level of anonymity and "security", you will be found out.
Snowden revealed secret documents, but he did not hide his identity, neither did Manning.
ALL of the "security methods" mentioned above will fail, and badly. Why?
They operate on the premise that there is this huge pool of potential leakers, of which you will be an anonymous entrant with nothing to point you out.
However: Most secure documents have a limited distribution/access list, and many are time sensitive, which fix their release to a certain point in time.
Suspicion will fall on you immediately, and there will be many indicators of your involvement right away, least of which is your post, on this site!
You will have to prove you did not, not the other way around, and if you are physically seized, you will confess.
For secure documents and most theft cases, the suspect is picked first and then their circumstantial evidentiary trail is used to lock in their guilt!
You used Tor? Not many do. Do you use Tor all the time? Oh no? You only used it just to upload these docs? Guilty.
How about going to a public wifi spot? Is it near where you live? Did you take your cellphone with you? (cell tower access logs)
Seriously, you are not a spy, and even if you are, you will be caught.
Your only hope is if someone else stole them and you got these documents outside of their knowledge, but the arrow is already pointing to you.