What could cause socket ConnectException: Connection timed out?
We have come across these in a similar case to yours. Usually at high load and not easy to reproduce on test. Have not fixed it yet but this is the steps we went through.
If it's a firewall issue, we would get a Connection Refused or the SocketTimeout exception.
1) Are you able to track these requests in the access log on the server - do they show an HTTP status 200 or 404 or something else? In our case, the server (IIS in this case) logs showed the client closed the connection and not the server. So that was a mystery.
Update: If the client always gets a 200, then the server has actually sent back some response but I suspect the response byte-size (if this is recorded in the access logs) will show a different value from that of the normal response size for that request.
If it shows the same size of response, then you have a (may be not plausible) condition that the server actually responded correctly but the client did not get the response back because the connection terminated somewhere in between.
2) The network admin teams looked at the TCP/IP traffic to determine which end (or intermediate router) is terminating the HTTP / TCP-IP conversation. And once we understand which end is terminating the connection is to look at why. Someone knowledgable enough could run snoop
3) Is there a max number of requests configured/restricted on the server - and is that throttling your connections?
4) Are there any intermediate load balancers at which requests could be dropped?
Update: One more thing we wanted to, but did not complete is to create a static route between client and server to reduce the number of hops in between and ensure no network related connection drops. See http://en.wikipedia.org/wiki/Static_routing
5) Another suggestion is setting the ConnectTimeout too to see if these work with a higher value. Update: You might want to try conn.getErrorStream()
Returns the error stream if the connection failed but the server sent useful data nonetheless. If the connection was not connected, or if the server did not have an error while connecting or if the server had an error but no error data was sent, this method will return null.
6) Could also try taking a set of thread dumps on the server 5 seconds apart, to see if any thread shows these incoming requests on the server.
Update: As of today we learnt to live with this problem, because we totalled the failure rate to be 200-300 out of 400,000 requests per day which is 0.00075 %