JDBC Connection Pool test query "SELECT 1" does not catch AWS RDS Writer/Reader failover
I've been giving this a lot of thought in the two months since my original reply...
How Aurora endpoints work
When you start up an Aurora cluster you get multiple hostnames to access the cluster. For the purposes of this answer, the only two that we care about are the "cluster endpoint," which is read-write, and the "read-only endpoint," which is (you guessed it) read-only. You also have an endpoint for each node within the cluster, but accessing nodes directly defeats the purpose of using Aurora, so I won't mention them again.
For example, if I create a cluster named "example", I'll get the following endpoints:
- Cluster endpoint:
example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com
- Read-only endpoint:
example.cluster-ro-x91qlr44xxxz.us-east-1.rds.amazonaws.com
You might think that these endpoints would refer to something like an Elastic Load Balancer, which would be smart enough to redirect traffic on failover, but you'd be wrong. In fact, they're simply DNS CNAME entries with a really short time-to-live:
dig example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com
; <<>> DiG 9.11.3-1ubuntu1.3-Ubuntu <<>> example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 40120
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com. IN A
;; ANSWER SECTION:
example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com. 5 IN CNAME example.x91qlr44xxxz.us-east-1.rds.amazonaws.com.
example.x91qlr44xxxz.us-east-1.rds.amazonaws.com. 4 IN CNAME ec2-18-209-198-76.compute-1.amazonaws.com.
ec2-18-209-198-76.compute-1.amazonaws.com. 7199 IN A 18.209.198.76
;; Query time: 54 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Fri Dec 14 18:12:08 EST 2018
;; MSG SIZE rcvd: 178
When a failover happens, the CNAMEs are updated (from example
to example-us-east-1a
):
; <<>> DiG 9.11.3-1ubuntu1.3-Ubuntu <<>> example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27191
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com. IN A
;; ANSWER SECTION:
example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com. 5 IN CNAME example-us-east-1a.x91qlr44xxxz.us-east-1.rds.amazonaws.com.
example-us-east-1a.x91qlr44xxxz.us-east-1.rds.amazonaws.com. 4 IN CNAME ec2-3-81-195-23.compute-1.amazonaws.com.
ec2-3-81-195-23.compute-1.amazonaws.com. 7199 IN A 3.81.195.23
;; Query time: 158 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Fri Dec 14 18:15:33 EST 2018
;; MSG SIZE rcvd: 187
The other thing that happens during a failover is that all of the connections to the "cluster" endpoint get closed, which will fail any in-process transactions (assuming that you've set reasonable query timeouts).
The connections to the "read-only" endpoint don't get closed, which means that whatever node gets promoted will get read-write traffic in addition to read-only traffic (assuming, of course, that your application doesn't just send all requests to the cluster endpoint). Since read-only connections are typically used for relatively expensive queries (eg, reporting), this may cause performance problems for your read-write operations.
The Problem: DNS Caching
When failover happens, all in-process transactions will fail (again, assuming that you've set query timeouts). There will be a short amount of time that any new connections will also fail, as the connection pool attempts to connect to the same host before it's done with recovery. In my experience, failover takes around 15 seconds, during which time your application shouldn't expect to get a connection.
After that 15 seconds (or so), everything should return to normal: your connection pool attempts to connect to the cluster endpoint, it resolves to the IP address of the new read-write node, and all is well. But if anything prevents resolving that chain of CNAMEs, you may find that your connection pool makes connections to a read-only endpoint, which will fail as soon as you try an update operation.
In the case of the OP, he had his own CNAME with a longer timeout. So rather than connect to the cluster endpoint directly, he would connect to something like database.example.com
. This is a useful technique in a world where you would manually fail-over to a replica database; I suspect it's less useful with Aurora. Regardless, if you use your own CNAMEs to refer to database endpoints, you need them to have short time-to-live values (certainly no more than 5 seconds).
In my original answer, I also pointed out that Java caches DNS lookups, in some cases forever. The behavior of this cache depends on (I believe) the version of Java, and also whether you're running with a security manager installed. With OpenJDK 8 running as an application, it appears that the JVM will delegate all naming lookups and not cache anything itself. However, you should be familiar with the networkaddress.cache.ttl
system property, as described in this Oracle doc and this SO question.
However, even after you've eliminated any unexpected caches, there may still be times where the cluster endpoint is resolved to a read-only node. That leaves the question of how you handle this situation.
Not-so-good solution: use a read-only test on checkout
The OP was hoping to use a database connection test to verify that his application was running on a read-only node. This is surprisingly hard to do: most connection pools (including HikariCP, which is what the OP is using) simply verify that the test query executes successfully; there's no ability to look at what it returns. This means that any test query has to throw an exception to fail.
I haven't been able to come up with a way to make MySQL throw an exception with just a stand-alone query. The best I've come up with is to create a function:
DELIMITER EOF
CREATE FUNCTION throwIfReadOnly() RETURNS INTEGER
BEGIN
IF @@innodb_read_only THEN
SIGNAL SQLSTATE 'ERR0R' SET MESSAGE_TEXT = 'database is read_only';
END IF;
RETURN 0;
END;
EOF
DELIMITER ;
Then you call that function in your test query:
select throwIfReadOnly()
This works, mostly. When running my test program I could see a series of "failed to validate connection" messages, but then, inexplicably, the update query would run with a read-only connection. Hikari doesn't have a debug message to indicate which connection it hands out, so I couldn't identify whether it had allegedly passed validation.
But aside from that possible problem, there's a deeper issue with this implementation: it hides the fact that there's a problem. A user makes a request, and maybe waits for 30 seconds to get a response. There's nothing in the log (unless you enable Hikari's debug logging) to give a reason for this delay.
Moreover, while the database is inaccessible Hikari is furiously trying to make connections: in my single-threaded test, it would attempt a new connection every 100 milliseconds. And these are real connections, they simply go to the wrong host. Throw in an app-server with a few dozen or hundred threads, and that could cause a significant ripple effect on the database.
Better solution: use a read-only test on checkout, via a wrapper Datasource
Rather than let Hikari silently retry connections, you could wrap the HikariDataSource
in your own DataSource
implementation and test/retry yourself. This has the benefit that you can actually look at the results of the test query, which means that you can use a self-contained query rather than calling a separately-installed function. It also lets you log the problem using your preferred log levels, lets you pause between attempts, and gives you a chance to change pool configuration.
private static class WrappedDataSource
implements DataSource
{
private HikariDataSource delegate;
public WrappedDataSource(HikariDataSource delegate) {
this.delegate = delegate;
}
@Override
public Connection getConnection() throws SQLException {
while (true) {
Connection cxt = delegate.getConnection();
try (Statement stmt = cxt.createStatement()) {
try (ResultSet rslt = stmt.executeQuery("select @@innodb_read_only")) {
if (rslt.next() && ! rslt.getBoolean(1)) {
return cxt;
}
}
}
// evict connection so that we won't get it again
// should also log here
delegate.evictConnection(cxt);
try {
Thread.sleep(1000);
}
catch (InterruptedException ignored) {
// if we're interrupted we just retry
}
}
}
// all other methods can just delegate to HikariDataSource
This solution still suffers from the problem that it introduces a delay into user requests. True, you know that it's happening (which you didn't with the on-checkout test), and you could introduce a timeout (limit the number of times through the loop). But it still represents a bad user experience.
The best (imo) solution: switch into "maintenance mode"
Users are incredibly impatient: if it takes more than a few seconds to get a response back, they'll probably try to reload the page, or submit the form again, or do something that doesn't help and may hurt.
So I think the best solution is to fail quickly and let them know that somethng's wrong. Somewhere near the top of the call stack you should already have some code that responds to exceptions. Maybe you just return a generic 500 page now, but you can do a little better: look at the exception, and return a "sorry, temporarily unavailable, try again in a few minutes" page if it's a read-only database exception.
At the same time, you should send a notification to you ops staff: this may be a normal maintance window failover, or it may be something more serious (but don't wake them up unless you have some way of knowing that it's more serious).
set connection pool idle connection timeout in your java code datasource. set around 1000ms