Is it possible to sniff HTTPS URLs?
TL;DR An attacker cannot see anything past the domain.
Structure of a HTTP request
HTTP works by sending two things to a website: the method, and the headers. The most common methods are GET
, POST
, and HEAD
, which retrieves a page, transfers data, or requests only response headers, respectively. TLS encrypts the entirety of HTTP traffic, including the headers and method. In HTTP, the path in the URL is sent along with the header body. Take this example, with wget loading the page foo.example.com/some/page.html
. This text, as ASCII, is sent to the server:
GET /some/page.html HTTP/1.1 User-Agent: Wget/1.19.1 (linux-gnu) Accept: */* Accept-Encoding: identity Host: foo.example.com
The server will then respond with an HTTP status code, some headers of its own, and optionally some data (such as HTML). An example, giving a 301 redirect and some plain text as a response, may be:
HTTP/1.1 301 Moved Permanently Date: Wed, 27 Dec 2017 04:42:54 GMT Server: Apache Location: https://bar.example.com/new/location.html Content-Length: 56 Content-Type: text/plain Thank you Mario, but our princess is in another castle!
Which would tell the client that the correct location is elsewhere.
These are the headers sent directly to the site over TCP. TLS works on a different layer, making all of this encrypted. This includes the page you are accessing with the GET
method. Note that, although the Host
header is also in the header body and thus encrypted, the host can still be obtained through rDNS lookup on the IP address, or by checking SNI, which transmits the domain in plaintext.
Structure of a URL
https://foo.example.com/some/page.html#some-fragment | proto | domain | path | fragment |
- proto - There are only two protocols in common use, HTTP and HTTPS.
- domain - The domain is
example.com
and*.example.com
, detectable with rDNS or SNI. - path - The path is completely encrypted and can only be read by the target server.
- fragment - The fragment is visible only to the web browser and is not transmitted.
What an attacker can see
So what can an attacker see if you make a request over HTTPS? Let's take the previous hypothetical request from the perspective of a passive eavesdropper on the network. If I wanted to know what you are accessing, I have only limited options:
- I see you making a web request encrypted with HTTPS going to
203.0.113.98
. - I see that the destination port is 443, which I know is used for HTTPS.
- I do an rDNS lookup and see that IP is used for
example.com
andexample.org
. - I look at the SNI record and see you are connecting to
foo.example.com
.
This is all I could do. I would not be able to see the path you are requesting, or even what method you are using, short of heuristic analysis based on the sizes of the data being sent and received, called traffic analysis attacks. For a large service like Wikipedia, I would have no idea what article you are viewing based on analysis of the unencrypted data alone.
An important note about referers on older browsers
Even though HTTPS encrypts the path you are accessing, if you click a hyperlink within that site which goes to an unencrypted page, the full path may be leaked in the referer
header. This is not the case anymore for many newer browsers, but older or non-compliant browsers may still have this behavior, as will websites which set the HTML5 referer meta tag to always send the information. An example sent unencrypted by a client go from https://example.com/private/details.html
to http://example.org/public/page.html
in such a case would be:
GET /public/page.html Referer: https://example.com/private/details.html User-Agent: Wget/1.19.1 (linux-gnu) Accept: */* Accept-Encoding: identity Host: example.org
As such, navigating from an HTTPS page to an HTTP page may leak the full URL (excluding the fragment) of the previous page, so keep that in mind.
The naive answer is no: the URL is encrypted in the TLS stream. But that answer ignores a great deal of relevant information.
Suppose it's Wikipedia. How long is an HTTP GET request for https://en.wikipedia.org/wiki/Cryptography
versus https://en.wikipedia.org/wiki/Information_security
, assuming all the header fields are the same? If you can measure the length of a request, which will likely be submitted in a single TLS record, then you can probably tell these apart.
That doesn't help you to distinguish a request for the article on cryptography from the article on choreography, of course. It also doesn't help if the TLS client cleverly adds some padding, ignored by the server, to the TLS record to round it to a multiple of some block size. But English Wikipedia has a much longer article on cryptography than on choreography. So even if the endpoints pad their TLS records to the maximum 16384 bytes, you can probably distinguish the article on cryptography from the article on choreography.
There's a complication from your perspective as the attacker: the client may use the same TLS stream for many requests and many responses. But they will likely all be timed in a burst as the victim loads a single page with embedded CSS, images, JavaScript, etc., and then go silent as the victim reads the page. The timing and number of these requests provides another variable on which you can discriminate what page they were looking for.
All these variables can be fed into a probabilistic model of pages—here's one example, lifted from the anonymity bibliography. Defeating that one example doesn't mean that the distribution of data an attacker on the network learns for one page is indistinguishable from another page, just that that particular distinguisher isn't as effective.
So, are you, as the eavesdropper, guaranteed to be able to read the URL off the wire? No: it is encrypted in the TLS stream (unless the NULL cipher is chosen!), so at best you can infer it from other observable variables with probabilistic dependencies on it.
On the other hand, is the victim guaranteed that their URL is concealed from an eavesdropper? No: there are many variables dependent on the URL that an attacker may be able to infer juicy information about, like which sexually transmitted disease you're reading about at the Mayo Clinic.
(Note that anything in the fragment of a URL—the part after the #
mark in https://en.wikipedia.org/wiki/Cryptography#Terminology
—is not transmitted in the HTTP GET request at all, unless there is some script on the page that triggers different network traffic dependent on the URL fragment.)