Net8 Transparent Application Failover (TAF)
Transparent Application Failover (TAF) is a Net8 feature designed to enable applications running in an OPS environment to gracefully recover from an instance failure by failing over to another instance accessing the same database. While designed for use with OPS, TAF can be used in non-OPS environments as well.
Unlike connect-time failover, TAF comes into play after an application has connected to an instance. If the connection to the instance is lost while the application is running, Net8 will transparently reconnect the application to another instance accessing the same database. The word "transparent" is best thought of in terms of the application user. In order for an application to take advantage of TAF, it must use failover-aware API calls that are built into the Oracle Call Interface (OCI). There are also TAF-related callbacks that can be used to make an application failover-aware.
SQL*Plus was one of the first applications to support TAF. Since then, Oracle has been working to add TAF capabilities to the following products:
TAF failover types
TAF supports two different types of failover: SESSION and SELECT. SESSION is the simplest type. When the connection to an instance is lost, SESSION failover results only in the establishment of a new connection to a backup instance. Any work in progress is lost.
SELECT failover is a bit more complex and enables some types of read-only applications to fail over without losing any work. When SELECT failover is used, Net8 keeps track of any SELECT statements issued in the current transaction. Net8 also keeps track of how many rows have been fetched back to the client for each cursor associated with a SELECT statement. If connection to the instance is lost, Net8 establishes a connection to a backup instance, reexecutes the SELECT statements, and positions the cursors so the client can continue fetching rows as if nothing had happened.
SELECT failover can be useful for reporting applications, but that's as sophisticated as TAF gets. There's no automatic recovery mechanism built into TAF to handle DML statements, such as INSERTs and UPDATES, that are in progress when a failover occurs. TAF has other inherent limitations as well.
TAF failover methods
TAF also supports two failover methods: BASIC and PRECONNECT. In both cases, you specify a net service name to use for the backup connection in case the primary connection fails. The difference between the two types lies in when the connection to the backup instance is made.
As good as it sounds, TAF has a number of limitations. Regardless of the failover type or the failover method, the following will be true when a failover occurs:
In addition to what gets preserved and lost during a failover, there are some connectivity issues to worry about. If a node goes down, your application may not notice it, and TAF may not be triggered until your application attempts to execute another SQL statement. A hung listener might cause a client to hang during a connection attempt. In that case, the client will never get a chance to attempt a connection to the backup instance. If the primary instance is up, but in an indeterminate state (such as during a startup or shutdown) client connections will fail, but not in a way that causes TAF to be triggered. Using ALTER SESSION to kill a client connection will also prevent TAF from being triggered.
The bottom line is that while Net8's TAF features represent a valuable piece of the puzzle when it comes to implementing robust applications that can fail over when necessary, you can't just slap TAF into place and expect all your applications to magically be capable of failover. As it stands now, TAF is most useful for read-only applications such as those for reporting or decision support.
Configuring TAF to connect to a backup instance
TAF is configured by adding a FAILOVER_MODE parameter to the CONNECT_DATA parameter for a net service name. TAF cannot be configured using Net8 Assistant. You have to manually edit tnsnames.ora or use the Oracle Names REGISTER command. If you are going to specify a backup instance, then you'll need two net service names: one to connect to the primary instance and one to connect to the backup instance.
Consider the situation shown in the next figure, where a client connects to an instance named ARK1. Under some circumstances, if the ARK1 instance fails, Net8 can automatically connect the client to the backup instance named DIA3.
Dynamically registering global database names with your listeners
One important issue to be aware of is that connect-time failover only works if you are dynamically registering global database names with your listeners. If you are statically configuring global database names, then connect-time failover will not work in a consistent manner:
If you want to use Net8's connect-time failover feature, you need to delete the GLOBAL_DBNAME parameter and allow the database to register itself with the listener automatically. You can list the database in your SID_LIST; you just can't include the GLOBAL_DBNAME parameter.
Listener.ora on ARK1
Listener.ora on DIA3
Failover Configuration in Tnsnames.ora on Net8 Client
The definition for prod contains a FAILOVER_MODE entry as part of its connection data. The BACKUP attribute for that entry specifies the net service name to which a connection should be made when a failover occurs. In this case, bkup represents the backup connection.
In this example, the backup connection is not to another OPS instance accessing the same database, but to an entirely different database. TAF can work either way, but if you fail over to a completely different database, you need some mechanism in place to keep it in sync with your primary database.
No FAILOVER_MODE entry has been placed in the definition for bkup. TAF doesn't support cascading failover. You can't fail over to a backup instance, and then fail over to yet another backup instance in the event that the first backup instance also fails.
Retries and delays
By default, when a TAF-initiated failover occurs, Net8 will make only one attempt to connect to the backup instance. Using the RETRIES and DELAY parameters, you can change that behavior so that Net8 makes multiple attempts to connect to the backup database. The FAILOVER_MODE specification in the following example calls for 20 retries at 30-second intervals:
In this case, if the connection to prod is lost, Net8 will make 20 attempts over a period of 10 minutes (20 × 30 seconds = 10 minutes) to connect to the backup instance through the net service name bkup. This can be useful if you are using Oracle's standby database feature. If your primary database fails, it might take a few minutes for the standby to be brought up to date and opened. Using the RETRIES and DELAY attributes, you can accommodate that delay.
TAF status from V$SESSION
The V$SESSION view provides some information about TAF settings for the sessions currently connected to the instance. The information in V$SESSION applies only to TAF, and not to connect-time failover. The three columns to look at are these:
The following example shows a query against V$SESSION that displays the available TAF information:
SQL> SELECT username, sid, serial#,
USERNAME SID SERIAL# FAILOVER_TYPE FAILOVER_M FAI
In this example, the session for the user SYSTEM has failed over to the backup database. Note the value YES in the FAILED_OVER column. Prior to failover, that column would contain the value NO.