High Availability - Operation and Recovery

Introduction

Behavior of all the components of the UCX when operating in a High Availability cluster configuration is heavily dependent on the network topology and capabilities of the individual devices (particularly 3rd party) connected to it. This document describes the various failure modes and the expected recovery times of the system components.

Failure Modes

Once a node in the High Availability cluster has been assigned the role of the Active node, the both nodes monitor the network for communication with the other node.  When the Standby node no longer detects that connection, it will initiate an internal process to change roles become the Active Node.  Once it becomes the Active Node, it will only change roles back to a Standby node if it is instructed to do so through the High Availability Console interface.

The operation of the High Availability cluster is therefore dependent on three factors:

  1. Communication between the Active and Standby nodes
  2. Network Connectivity of the Standby node
  3. Network Connectivity of the Active node

Communication between the Active and Standby nodes

Losing communication between the Active and Standby nodes while both nodes are still operational can result in both nodes operating in Active mode, and the resulting behavior is unpredictable as it is highly dependent on the entire network topology.  This possibility can be practically eliminated if the nodes are connected on the same Layer 2 (Ethernet) switch.
 

Note: In order to ensure predictable behavior of the High Availability Cluster, E-Metrotel recommends connecting both the Active and Standby nodes to the same Layer 2 (Ethernet) switch at the same location.

Network Connectivity of the Standby node

The Standby node can lose connectivity to the Active server in a number of ways, such as losing power, being gracefully shutdown or restarted through the UCx Web-based Configuration Utility, or the network cable being disconnected at the UCX or the switch.  In all cases, the current operation of the High Availability cluster is not impacted, as the Active node for all system behaviors.  Once the Active system has detected the loss of communication, the cluster will change the status of the Standby node on the Console page on the High Availability tab, highlighting the Secondary node as red, with no information regarding the Status, Inter-node Link, or HA Resources.

It is also important to note that until the Standby node has recovered and and is detected by the High Availability Cluster, the Action buttons are no longer available as the redundant abilities are temporarily unavailable.  Once the Secondary node has recovered and is communicating with the High Availability cluster, the Console will reflect this as follows:
PrimaryActive.png

Network Connectivity of the Active node

The Active node can lose connectivity to the Standby node in a number of ways, such as losing power, being gracefully shutdown or restarted through the UCx Web-based Configuration Utility, or the network cable being disconnected at the UCX or the switch.  In all cases, the Standby node will detect the loss of communication and will initiate the process of changing its role to that of Active node. In that role, it will begin to communicate using the IP address of the High Availability cluster.  The process to detect and switch roles takes approximately 15 to 60 seconds. Once the role change has been completed, this is reflected in the Console page as follows:
SecondaryActive.png
As is the case above for loss of Standby node communication, it is important to note that until the former Active node (which was configured as Primary in this example) has recovered and and is detected by the High Availably Cluster, the Action buttons are no longer available as the redundant abilities are temporarily unavailable.  Once the former Active node has recovered and is communicating with the High Availability cluster, it will remain in Standby role until forced to switch roles or until the currently Active system goes off-line.  The Console will reflect this as follows:
PrimaryRecovery.png

High Availability Cluster Recovery

Since all communication from peripheral devices and supporting equipment has been previously configured to use the cluster IP address, the system will be able to communicate with each network device without any necessary user intervention, although different components will receiver at different rates.  These are described below.

E-MetroTel XStim and Nortel/Avaya Unistim IP Phones

Existing calls will be dropped. If the phones are in the same LAN infrastructure as the HA cluster, they should be fully operational for initiating and receiving new calls within 30 seconds of the new Active node becoming operational. If the phones are remotely connected, then the reconnection process will be automatically initiated by the phones within 90 to 120 seconds.  Manual intervention at the phone using the "Retry Now" softkey when it is displayed can shorten the recovery time

Nortel/Avaya Digital and Analog Phones connected via an MGC

Existing calls will be dropped.  The MGC will typically recover communication with the Cluster within about 60 seconds.  However, in some scenarios this process can take up to 11 minutes for certain timeout parameters on the MGC to detect and restart the communication.  If access to the MGC is available, a power recycle of the MGC may shorten the overall recovery process.

InfinityOne Desktop Clients

Existing calls will be dropped.  The client application will also be logged out and the user will need to log back in.  Once logged in, the softphone will be immediately available for making and receiving calls.

SIP Phones:

Existing calls will be dropped.The phones will be able to make new calls as soon the Secondary node becomes Active.  The phones will need to re-register as part of the standard SIP protocol process before being able to receive new incoming calls.  The re-registration process is controlled by the SIP phone configuration, and happens on an interval specified by the phone.  Some manufacturers have extremely long default values for this process (even as much as one or two hours).  

Note: E-MetroTel recommends setting this parameter to be on the order of 60 to 120 seconds. 
The method for setting this parameter can vary for each manufacturer and phone type.  Please consult the phone's configuration documentation for specific instructions.

SIP Trunks

Existing calls will be dropped.  The trunks will recover based on the timeout of the Registration settings in the UCX and the Trunk Provider.  UCX Registration timers are setting on the SIP Settings menu item of the PBX Configuration page of the PBX tab in the UCx Web-based Configuration Utility.  The default Registration Timer expiry on the UCX is 120 seconds.

Hospitality and HOBIC interfaces

The protocols used for these interfaces have their own mechanism for automatic reconnection after a temporary loss of communication.  Since the protocol also included an Acknowledgement mechanism, any messages sent during the loss of communication will be resent after communication is restored.  The UCX High Availability cluster must first enable the Hospitality service as part of the role change for the soon-to-be Active node.  This process can take between 60 and 120 seconds, but the end-to-end recovery time will also be dependent on reconnect timer settings on the PMS platform.
 

Page Tags: 
high availability
ha
harc
active standby