Application network connection drop-outs

Overview

There is a network related issue that I come across time and time again in Data Centre migration and Transformation projects. It is the phenomena of migrated applications appearing to work OK but then having their network connections drop out after a period. Typically, this is far more apparent during the testing phase than in an actual full production migration, but it can and does happen in both scenarios.

Setting the scene

To further illustrate what I mean here is a recent scenario I encountered. A banking application was being migrated to a new Data Centre. The application was a fairly typical 3 tier design. At the front end there were several Internet facing Web servers, in the middle there was an application server processing business logic. This in turn talked to a database server at the back end. As the diagram below shows, in the legacy Data Centre the Web servers were separated from the Application and Database servers by a firewall. However, the Application and Database servers are effectively in the same zone with no firewall between them.

Legacy Firewall configuration

Modern best practice often involves creating many security zones within an overall network. Doing so allows much finer grained access control to applications and data. So the same application infrastructure reworked to fit in the new Datacentre using multiple security zones may look something like the set-up shown below.

Reworked network zone infrastructure

The problem emerged during pre-migration acceptance testing. All the application’s servers were started up and the test team carried out some basic functionality testing. Then they stopped overnight with a view to restarting further testing in the morning. When they tried to resume testing the next day it did not work. Investigation eventually revealed that the application server’s database connections had all dropped (Best practice would dictate that the application server logic be coded to re-establish the connections but for various reasons that was not the case). Restarting everything fixed the problem but clearly that would be a poor solution.

The Cause

Many firewalls have a time limit on a connection that has no traffic flowing through it.  A value of 90 minutes seems to be common for this setting but check with your Firewall administrators.  When a TCP connection is first established the firewall starts a timer.

Firewall connection timer starts

After this period with no traffic passing the firewall will silently tear down the connection and will not allow any more data for it to pass.

Firewall connection timer expires

As the firewall is transparent to the applications at both end of the connection no message notifying them of the connection “tear down” is sent by the firewall. The applications will only discover that this has happened the next time they try to send data over the connection. Then they will get an error returned from their socket layer API calls.

Trying to keep inactive connection alive

So one way of avoiding this problem at the application level would be to build in “keepalive” functionality into the server-side and client-side of an application. How would that work? Well at the application level, if one side has not communicated with the other for some pre-set period an “I am still here” message is sent. The application will be coded so that this does not affect application functionality and the message is effectively thrown away (but it keeps traffic passing through the firewall). In fact, some applications do this, but not many. Rather, well coded applications make use of an option on their TCP socket connections known as “SO_KEEPALIVE”. This is a parameter that an application developer can set when a TCP network connection is initially established. What this option does is hand over the responsibility for keeping the connection alive from the application to the Operating System’s TCP stack. Well actually I am lying, the original intention of this feature was not to keep long running connections alive but rather to detect (and kill) failed connections. However, the feature now does dual duty. In these days of 24×7 Web applications we expect our Web servers to remain connected to app servers or back end database servers for weeks,  if not months, at a time. It so happens that we can use this feature (originally designed to detect drop outs of things such as remote terminal sessions) to force firewalls to keep the connections we need up during low (or even no) transaction volume.

Does your application use TCP Keepalive?

For the fix, that I will detail in the next section, to work you need to know if application is actually coded to use the SO_KEEPALIVE option on its network connections. Obviously, most of the time you won’t have the actual source code of the application, so you won’t be able to inspect that (even assuming you are used to reading large pieces of source code !) So how can you tell if your application uses it? Well it turns out that it is surprisingly difficult on most operating systems as many , including RHEL variants of Linux including OL and CentOS, don’t make that information via the “/proc” interface (for an overview of /proc see https://www.tldp.org/LDP/sag/html/proc-fs.html) It is also important to remember that the keepalive functionality is not actually part of the TCP specification [ Ref TCP/IP Illustrated Volume 1 second edition Stevens & Fall 2011] As it is not part of the protocol specification there are no flags set in TCP packets that are exchanged to start a TCP connection. Therefore you cannot use a tool such as Wireshark directly to see if particular connections have the feature enabled (the method I discuss later does, in fact, make use of Wireshark. You could take it on trust that most major Database products including Oracle and SQL Server have this enabled. However, if you are a curious like me you will want a way that you can prove it. Well the method that I have for doing that  is a bit clunky and it also depends on you understanding what is involve in the fix which is putting the cart before the horse. So I will detail my test as an Appendix after I have explained the fix (go here if you want to see this now)

The Fix

Of course, you could either extend the time out in the firewall or turn the mechanism off. If you do extend the time out what do you set it to? Whatever value you choose for a timeout there may be some genuine connection that remains inactive for longer. As for turning the feature off the Infosec group in many organisations will possibly object to taking that course of action.

The fix that I always go with is changing some OS parameters so that the network connection is kept active by the OS itself, even if the application is not doing anything. A caveat to this is that the TCP keepalive mechanism is not enabled by default and applications need to have enabled it on their network connections (sockets). Having said that pretty much all major software that I have come across enables TCP keepalive on their network connections. What the fix does is to get the TCP layer of the OS to do a handshake with the other end of the communication every “n” seconds. Of course, we need to set “n” to a value that is lower than the firewall timeout. The handshake is purely an exchange of TCP info and is transparent to the application at either end of the connection. In fact what the host exercising keepalive sends is a TCP “ACK” (acknowledgement) message for a network packet that it never received.  You can set up this handshake at both ends of a connection (Client and Server) However, typically it is only ever set on the server end and this is sufficient. There are 3 parameters that control the TCP Keepalive mechanism, these are listed in the table below. Note that different operating systems have slightly different names for these parameters. The onles listed below are “generic”.

Configuring RHEL/CentOS/OL Linux systems

If we look at the default settings on one of my RedHat 7.3 servers The three parameters are shown in the table below.

The current values of these can be displayed using the commands shown in the example below. As mentioned this is achieved by listing the contents of pseudo files within the /proc file-system.

Keepalive defaults RHEL 7

To change the values you can simply use the “echo”  command and direct the output to the appropriate /proc pseudo file as shown below. ( You need to be root or use sudo to do this)

Set keepalive time to 30 seconds & verify

The method shown above makes the change in the running system and will affect all TCP connections made after the change. However, the settings will disappear on reboot. To make them persist you need to place the settings in the /etc/sysctl.conf file as shown below

 

 

Configuring Windows systems

Windows, unlike Linux,  does not expose the TCP keepalive parameters directly. As with many aspects of Windows their settings are controlled by the Registry. The place where the keepalive keys sit is

Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters

And the key value we are particularity  interested in is KeepAliveTime Note that Windows uses time units of milliseconds. However, buy default that Key is not created in the Registry. Rather the Windows Kernel uses the default value of 2 hours

KeepAliveTime not visible in registry be default

Other Operating systems

Pretty much every operating system or appliance supports these parameters. From personal experience I know you can change them on AIX, Solaris and HP-UX. I have also changed them on appliance boxes such as HSMs used for credit card payments. All the same principals as outline above apply. Watch out for the time units any particular OS uses (Linux uses seconds whereas Windows uses milliseconds) Do a bit of googling to find the recipe for your OS.

Does your application support TCP Keep-alive?

As promised earlier in this article I will now outline a way to test to see if a given application supports the TCP Keep-alive mechanism. Remember we said previously that this is not easy as many Operating Systems do not provide utilities to show what Socket options a network connection has been set up with. Also the Keep-alive mechanism is not part of the TCP protocol so there are no packets going out at connection set up with an “I am using Keep-alive” flag set. For this example I am going to use Oracle SQLplus client on a Windows 10 PC connecting to an Oracle 12cR2 database running on an Oracle Linux 7.6 server. The key to the test method is to have some sort of client utility that can connect to the application server and then sit there doing nothing. Any interactive SQL client is a good candidate for most Databases. The other component we will use at the client end of the connection is the Wireshark network traffic analyser, which is available free for many Operating Systems.

Step 1 Set up a Wireshark capture

The Oracle 12c Database server is running on an Oracle Linux server at IP address 10.123.219.193. So we set up a capture on the Windows 10 box that we are going to use the SQLplus client on.

Set up the capture

Initially there is no traffic going to that database server from my Windows desktop so the Wireshark  capture window looks like this.

No network traffic going to database server

Now we run the SQLplus client on Windows (I had already configured tnsnames.ora which provides the IP address of the database server). I let it connect to the database server but I don’t do anything. I just let it sit there at the SQL> prompt. This will mean there is no network traffic between the server and the client once the initial connection has been established.

Fire up SQLplus on the client

The Wireshark trace of the initial connection looks like this.

Trace of initial SQL plus connection

After waiting 2 minutes (a length of time I picked arbitrarily) I type “exit” in the SQLplus client which will disconnect me from the database.

Exit from SQLplus on the client

We can see that between the time of initially establishing the connection (green box) and me disconnecting approximately 2 minutes (120 seconds) later no traffic has passed over the connection.

Trace of SQLplus disconnecting

Now we do exactly the same thing again except before we do, we change the timeout value on the Linux OS. The defaults are:-

Keepalive defaults

We now change the tcp_keepalive_time parameter from the default 7200 seconds (2 hours) to just 30 seconds. So if the Oracle database software is using SO_KEEPALIVE on its network connections we should see some keepalive traffic after 30 seconds of inactivity.

Set keepalive time to 30 seconds & verify

Again I connect using the SQLplus client and do nothing. Now we can see that after 30 seconds the network layer sends a keepalive which is acknowledged (Black background lines on the Wireshark trace)

Keep alive packets are sent after 30 seconds

So you will usually be able to use a similar technique to test if a software package uses keepalive. All you need is client software that is happy to sit there doing nothing once it has established a connection to the server.

 

Summary

This phenomenon is usually caused because network logical and/or physical topology is changed in the new Data Centre that the application is being migrated to …. But not always

In the old Data Centre, your live applications may be very active. So active that there is never a quiet period long enough to trigger a firewall killing, what it deems to be, an inactive TCP connection. In this case your live legacy set up may be set up in a similar way to the new target Data Centre. Indeed, it may have had the same disconnect problems years ago before it went live. However, now that has all been forgotten and the live systems appear not to have the disconnect problem