Is privacy compromised when sharing SHA-1 hashed URLs?

It's better but not perfect.

While it is (currently) impossible to get the URL for a given hash, of course every URL has the same hash.

So it is not possible to see all the URLs a user browses, but it is quite likely to get most of them.

While it isn’t possible to see user A visits HASH1 and conclude that HASH1 means fancyDomainBelongingToUserA-NoOneElseVisits.com, it is for example possible to just calculate the hash for CheatOnMyWife.fancytld and then see which users visit that site.

I wouldn’t consider that to be protecting the users privacy.

Also just matching users who visit a lot of similar domains can be pretty revealing.


I think it's good that you want to protect a user's privacy, but what you're building seems to be opposed to protecting privacy, so I don't think it's possible to do with a simple setup (e.g. client sending url, in whatever form, directly to your backend service).

As others have noted, hashing using sha1 is a good first step, but it only achieves privacy against humans risking a quick glance into the database. It doesn't give you much privacy against algorithms designed to analyze the database contents.

You're leaking more than the visited url, too: The user also tells you at what time he was online and looked at the given url if you're doing real-time checking.

A few others have suggested solutions to mitigate the privacy issues. While they're all better than not doing anything, they don't solve the problem. For example, Google's solution of only sending 32 bits of the hash looks nice, but that still only maps all existing urls to a hash table with 4 billion slots. Some of these slots may contain a large number of entries, but since not all urls are equally likely to be visited (for example, facebook urls are much more likely to be visited than some primary school's homepage) and the urls of a single domain will most likely be hashed fairly evenly over the 4 billion available slots, it will still be quite easy to guess, given a set of full urls which hash to the same 32 bit prefix, which url was actually visited (especially for google, who has pagerank data on a huge number of urls out there...)

Such an attack involves someone building a rainbow table of URLs he's interested in. You could make it more difficult by

  1. Using a password hash function instead of sha1, which takes a long time to calculate the hash - but this will mean that your browser plugin seems unresponsive.
  2. Salting your hashes. Obviously you can't give every user his own salt, or all hashes for the same url provided by different users will be unique, most likely making your application pointless. But the larger your userbase grows, the less users need the same salt values. You still don't protect user privacy, but you make it harder to compute rainbow tables to find out exactly which urls were visited, and if someone does it for the salt of a specific user, only the privacy of all other users sharing his salt is compromised.

However, this still doesn't help anything at all in cases where an attacker isn't interested in the whole set of hashed urls, but only wants to answer very specific questions (e.g. which users visited urls belonging to the domains in a given "blacklist"?) Since such queries will only involve a short list (maybe a few dozen up to a few hundred thousand urls, depending on the size of the blacklist), it's trivial to hash each of them in a short amount of time, no matter what countermeasures you use to slow it down.

It's worse than that, because many websites only have a few common entry points, the most likely one being just the domain followed by an empty path. Others commonly visited paths are login pages, profile pages etc, so the number of urls you need to hash in order to determine if someone has visited a specific domain is most likely very small. If an attacker does that, he'll miss out on users who used a deep link into a website, but he'll catch most of them.

And it gets even worse: If an attacker manages to find one full url from a hash that a user provided, he might very easily get all the urls for a large part of the browsing session of that user. How? Well, since he has an url, he can dereference it with his own custom spider, look at all the links in the document, hash them and look for them in your database. Then he does the same with those links, and so on.

So you can do a few things to make it harder, but I don't think there's a way around the user having to basically trust you with his browsing history. The only ways around that which I can see would involve building a distributed system not completely under your control and using that to collect urls, for example a kind of mixer network. Another venue might be to have the clients download large parts of your database contents, thus hiding which urls they were actually interested in, and provide new content for your database only in large packets, which would at least hide the time component of the user's browsing.


Short answer.

While you state you are concerned about your end-user’s privacy, it’s not clear who you intend to be “protecting” them from and for what reason?

  • If the core functionality of your application is to—essentially—farm user data from a client, send it to a server and deliver a result, then you as the recipient of that data will always know what that data is.
  • If your goal is to protect data in transmission from the client to the server from prying third parties, then an encryption scheme can be devised to protect transmission. But that is the absolute best you can do to protect user data.

Long answer.

First you say this:

I’m working on a small side project which involves a browser add-on and a backend service. The service is pretty simple: Give it a URL and it will check if the URL exists in a database. If the URL is found some additional information is also returned.

Then you say this:

The browser add-on forwards any URLs that the user opens to the service and checks the response. Now, sharing every URL you’re browsing is of course a big no-no.

The problem with the scheme you describe and your concerns for privacy is that your applications core, inherent behavior is to share information that is traditionally considered private. So at the end of the day, what level of “privacy” do you intend to protect for who, from what and for what reason?

If someone agrees to use your application—having some basic, rudimentary knowledge of what the application does and what information it shares—chances are good they know that your backend server will know exactly what they browse. Oh sure, you can setup any elaborate, contrived hashing scheme you can come up with to “mask” the URL but at the end of the day your backend server will know the end user’s data. And even if you are convinced this data is somehow unknown to you, it still does not stop the perception that you would know what the data is; and honestly I cannot conceive of a scheme where you can provide this service and you do not know what URLs are being browsed.

If you are concerned about user data leaking out in transmission to potential 3rd parties of some kind then perhaps you can come up with some encryption scheme that can protect the data being transmitted. To me, that is doable.

But if your overall desire is to collect private data of some kind to analyze it and then deliver an end result, the overall concept of you—and your system—somehow not knowing specifics about that data is flawed. You control the backend of a process like this and you completely have access to the data whether you like it or not.