Should I worry about remote timing attacks on string comparison?
There are published successes with remote timing attacks.
From the document -- "... we can reliably distinguish remote timing differences as low as 20µs." So yes, you should be worried about the underlying implementation of .equals()
(spoiler: Not secure). Implement .equals()
using a sum of XOR of characters to compare in a timing-independent way.
Here's a python implementation as an example of a timing independent compare of bytes.
def equals(bytes1, bytes2):
if len(bytes1) != len(bytes2):
return False
else:
differences = 0
for a, b in zip(bytes1, bytes2):
differences |= a ^ b
return differences == 0
In theory, this is a possible exploit, and if you are in super-paranoia mode, you should assume the answer being "Yes". In every other case, the answer will be: "No.".
Although there are published papers (one is linked in the answer by @Oasiscircle) which claim that they are able to run successful timing attacks, one has to carefully read the preconditions, too.
These published "practical" attacks work on some algorithms on a LAN with one, at most two, switches in between. Which implies an almost perfectly reliable, constant round trip time. For that scenario, it is indeed practical to attack certain algorithms via timing, but this is meaningless in the context of the question.
In fact, I consider these remote attacks as "cheating". The fact that an attack is remote is irrelevant if you carefully design the experiment so the delay is nevertheless almost exactly predictable.
When attacking any server on the internet, this precondition does not hold (not even remotely, pun intended), even on a server that is geographically and topologically near.
Also, attacking a string comparison via timing is not at all the same as attacking a RSA calculation. It is much more difficult because the entire operation as well as the measurable difference is a lot smaller.
A string comparison of a password (assuming your passwords are "reasonably" sized) takes a few hundred cycles or less, of which the possible initial cache/TLB miss is by far the biggest, dominating factor, followed by the terminal mispredicted branch (which happens for both a match and a non-match). The difference between a match and a non-match is maybe one or two dozen nanoseconds.
A context switch takes several hundreds of nanoseconds, as does a cache miss. Schedulers typically operate at a micro- or millisecond resolution and do some very non-trivial work (in the hundreds/thousands of nanoseconds) in between at times that are hard to predict to say the least.
Reliably measuring differences on the nanosecond scale at all is not entirely trivial, either. Ordinary programmable timers do not nearly have the required resolution. HPET on commodity hardware is guaranteed to deliver 100ns resolution (per specification) and in practice goes down to 1ns on many implementations. However, it works by generating an interrupt. This means you can schedule a timer to some point in time precise to the nanosecond, but you cannot really use it to measure single nanoseconds. Also, the interrupt adds an overhead and uncertainty of some dozen nanoseconds (... to some dozen nanoseconds that you want to measure!). Cycle counters need to be serialized to be accurate. Which, too, renders them rather useless for precisely measuring an external event at nanosecond resolution since their accuracy depends on what the pipeline looked like.
There are more things to consider which add unpredictable noise, such as legitimate users (yes, those exist, too!) and interrupt coalescing.
Trying to divine something-nano from samples that include several something-different-nano as well as something-micro and several something-milli is a Herculean task. That's noise from several independent sources on every scale.
Finally, consider the mention of "Java", which means that e.g. a garbage collector may be executing at an unpredictable time (in any case, unpredictable for a remote attacker), causing unpredictable jitter on an unknown (micro, milli?) scale.
In theory, you could of course collect a large number of samples, even at lower resolution, say microsecond scale, and statistically eliminate the various sources of noise. You would never be able to tell for absolutely certain whether a password is correct, but you will eventually be able to tell with a sufficiently high probability (say 85% or 90%, or even 99%), and you can then manually verify those few candidates. That's good enough!
This is possible, at least in theory, but it would take a huge number of samples even for divining a single password. And saying "huge" is really an understatement of galactic proportions. The number of samples needed practically implies that you must parallelize the attack, or it will take forever.
Now, parallelizing such a timing attack to any serious extent is not easily possible because you are subject to the observer effect (in the same sense as in quantum mechanics).
Doing a couple of probes (maybe 5-8) in parallel should work, presuming that the server has enough idle cores, but as you scale up, eventually one probe will inevitably affect another probe's outcome in an unpredictable and disproportionate way. There is nothing you can do to prevent that from happening, so parallelizing doesn't really work well (I am not even taking into account the fact that interrupts usually go over a single core and that there is only a single physical copper wire which data must go through, so even if the server still has idle cores remaining, it may quite possibly be the case that one probe affects another).
On the other hand, running a not-massively-parallel attack is bound to fail because you will die of old age before you find a single password.
Store a good cryptographic hash of the secret on the server (i.e. treat it like a password). Your comparison then would be to take the hash of the string the client sends you, and compare the hashes.
If the secret has high enough entropy, this should eliminate timing attacks and prevent leaking of the real secret string, since it should be practically impossible to recover the secret from the hash.
On the other hand if the amount of entropy in the secret is not enough to prevent dictionary attacks, this alone isn't enough. An early-exit comparison can still allow the attacker to learn the first few bytes of the hash; then a subsequent dictionary attack might be able to recover the secret from its hash. (See also Timing attacks on password hashes for more discussion of the possibility of such timing attacks.) This can be prevented by comparing the two hashes using a constant-time comparison method.
So, the most robust solution would be to store a hash of the secret, hash the string the client sends you, and compare the two hashes using a secure constant-time comparison method. Using a salted hash wouldn't hurt, either.