Solution Brief

Stratus everRun® SplitSite®

Metro-wide availability protection

Disasters, whether caused by nature or human error, can result in the total loss of a physical data center, potentially leaving your business unable to function for days or even weeks. In regulated industries, a site-wide problem can lead to data loss that risks compliance, adding significantly to your downtime costs. That’s why businesses in regulated industries like pharmaceuticals, manufacturing and financial services use everRun SplitSite protection to ensure that all data is safely replicated and remains available at all times. Although many organizations continue to put off implementing a recovery solution fearing high costs and resource demands, there’s no need for you to endure the risk any longer.

everRun SplitSite extends the protection of your business from localized power failures and building-wide problems to physical machines located in different buildings or data centers. With everRun SplitSite, if disaster strikes in one location, applications and data are immediately available, up-to-date, and fully operational at the other location without the need for IT staff at the second location. A SplitSite configuration connects two physical machines (PMs) in two geographically separated sites. SplitSite provides application availability using synchronous replication. everRun’s SplitSite capability allows a customer to run their applications efficiently, although with higher latencies when everRun servers are separated geographically or by network switches. Both high availability (HA) and (fault tolerant) FT protection levels can be used, with no change in features or availability. As in a single-site configuration, everRun automatically detects disk and network failures and configures around them. And for virtual machines (VMs) with FT protection, everRun will keep VMs running with no downtime, even through a PM or site failure. When a failed site or PM is returned to service, everRun will automatically resynchronize the disk drives and VM memory.

everRun’s SplitSite supports disaster-tolerant deployments that maintain hardware redundancy, as well as redundancy of physical computer rooms and the buildings containing them. By supporting geographical separation, this powerful disaster tolerant solution further safeguards your business from major downtime due to potentially catastrophic events such as flooding and power outages. everRun SplitSite eliminates the cost and chaos associated with typical reactive recovery products. Stratus often sees SplitSite used in larger campus or metropolitan settings as a real-time alternative to multi-site disaster recovery.

everRun SplitSite extends the protection of your business from localized power failures and building-wide problems to physical machines located in different buildings or data centers.

SplitSite Requirements, and Licensing

There is no universal distance limitation for SplitSite, as a number of factors come into play. Any intervening network switches add to latency and increase the possibility of losing the connection between the nodes resulting in “split brain” – a situation where neither server can verify that the other is still running, resulting in two copies of the same VM running independently. For all SplitSite configurations, Stratus requires that you also use the quorum service because a SplitSite configuration increases the likelihood of additional split-brain failure scenarios.

SplitSite configurations are subject to maximum latency specifications: No more than 10ms round trip A-Link latency for HA VMs and 2ms round trip A-Link latency for FT VMs. Separation of up to 10km (using 1 Gbps fiber) is a common A-Link network topology that can meet the latency requirements. Individual application performance, even within these latency specifications, can depend upon the specific application.

Only a license use of quorum and compliance with latency requirements are necessary for support from Stratus. Otherwise, any network equipment and topology are accommodated. In a typical mainstream network, a good distance between the servers is 5km to 10km. However, Stratus does have customers who are successfully using SplitSite today in scenarios where the PMs are 50km or more apart from each other.

A SplitSite configuration requires careful planning of component placement, to minimize or eliminate failures that require VMs to be shut down. Specific training or professional services assistance are very likely necessary to deploy SplitSite correctly. If a customer uses SplitSite they are obligated to purchase a license; however, Stratus does not enforce SplitSite through feature activation. A SplitSite license is, however, required to receive technical support on a SplitSite configuration. Stratus has chosen a 10m physical separation as a reasonable distance demarcation point at which to require SplitSite licensing.

SplitSite and Quorum Servers

Use of quorum is required for SplitSite configurations to protect against data loss (due to split-brain) and to safely enable VMs to start up automatically if a second everRun PM or site has failed. In a SplitSite configuration, you will use at least one, and optimally two, quorum servers. These servers are used to protect against network failures which might cause the two everRun nodes to lose communication with each other and operate split brain. Quorum availability is improved, and mandatory VM shutdown scenarios are minimized if quorum is placed at a third location and an appropriate quorum networking design is implemented.

If there were no quorum servers configured, a network failure could cause the two everRun servers to lose all communication with each other. In the same situation with quorum servers configured, VMs redundant on both nodes would ask the quorum server for the status of their peer and take the appropriate action based on the response. If the quorum server fails to respond, an isolated VM will shut itself down. Whenever the peer VM on the other server remains in contact with quorum server, it continues to run. Both VM instances agree on which quorum server is being used (elected) in advance of any failure. If the primary quorum server fails, the nodes agree to elect the alternate quorum server until the primary returns to service. During active management of a failure, the nodes may not switch quorum servers.

Quorum servers are particularly important in SplitSite configurations. Best practice for SplitSite is to place a preferred quorum server in a third facility and an alternate quorum server in a fourth facility. However, you can also place the alternate quorum server with the preferred quorum server and still obtain satisfactory service. Quorum servers ensure the integrity of VMs against split brain, and provide for unattended startup of VMs after specific failures. Quorum server communication occurs via the management network.

Quorum servers don’t require dedicated hardware or have any specific network latency requirements. They run as a Windows service that can be installed on almost any Windows workstation or server that’s used for other purposes as long as the computer will be left running 24 hours a day. However, one should never run the quorum service in a VM of the same everRun system that uses it.

More on Quorum Servers

A quorum service is a Windows operating system-based service deployed on a Windows machine distinct from the everRun system. Quorum servers provide data integrity assurances and automatic restart capabilities for specific failures in an everRun environment. You can configure an everRun PM pair with 0, 1, or 2 quorum servers. Stratus strongly recommends configuring two quorum servers: a preferred quorum server and an alternate—especially for SplitSite operation. If only two sites are available quorum can be placed on one of the sites without risk of split-brain. However, if one PM goes down and the surviving PM is unable to communicate with the quorum server (for example, because it is inaccessible on the same site as the down PM), the VMs at the surviving site are automatically shut down to avoid a possible split-brain scenario.

In a SplitSite configuration best practices for quorum deployment include:

  • A preferred quorum server located in a third facility, and an alternate is located in a fourth site (or carefully placed in the third)
  • Quorum servers should be as isolated as possible. If both must be placed in a common (third) site, make sure that they do not depend on common power sources or network switches
  • Physical connectivity between an everRun PM and the quorum servers should not route through the other PM’s site
  • Placing a quorum server in the same site as one of the everRun PMs ensures data integrity. However, failures of that site require that the VMs be shut down (to assure against split-brain) until manually recovered
  • The management network physically connects the PMs and the quorum servers. Configure each everRun PM to use a different gateway to reach the quorum servers for best availability of VMs. If the two PMs use the same gateway to reach the quorum servers, some site failures will cause the gateway to fail and require the VMs to shut themselves down until manually recovered

Quorum Server Considerations

  • Quorum service software, can be installed on any general-purpose computer or laptop running Windows Server 2016, Server 2012, Server 2008, Windows 10 or Windows 7; always powered on and with 100MB minimum disk space and a network interface card with connectivity to the everRun configuration via the management network
  • Quorum servers should not reside on the same site as either PM when deploying in a SplitSite. If both preferred and alternate quorum servers fail for a common reason, VMs will gracefully downgrade redundancy, and then continue to operate using one PM, pending recovery of a quorum server. Whenever, a PM and the elected quorum server fail for a common reason, the VM instances running on the surviving PM must shut themselves down
  • If the preferred and alternate quorum servers must reside on a common site, power them from separate AC power sources (phases) or configure them on separate UPS devices, and minimize common networking required for the everRun system to access them

A-Link Network Requirements

  • NICs must be at least 1 Gb and full-duplex; use 10 Gb, if possible
  • Switches and/or fiber-to-copper converters connected to the private network must be non-routed, non-blocking and support IPv6
  • For systems running FT-protected VMs, A-Links require:
    • A minimum bandwidth of 1 Gbps per VM
    • A maximum inter-site latency* of 2 ms, round-trip time
  • For systems running only HA-protected VMs, A-Links require:
    • A minimum bandwidth of 155 Mbps per VM
    • A maximum inter-site latency* of 10 ms, round-trip time
  • Do not use a common card (multiport NIC) for multiple A-Links
  • A-Links can be a dedicated point-to-point fiber connections or on a VLAN. VLANs used to connect the A-Link ports must not filter any communications between the two everRun nodes

Private Network Requirements

  • NICs must be at least 1 Gb and full-duplex
  • The private network must not be shared with an A-Link when deploying a SplitSite configuration
  • The private network can be a dedicated point-to-point fiber connection. If it is not, it must be configured on a private VLAN. VLANs used to connect the private-network ports must support IPv6 and not filter any communications between the two everRun nodes

Business Network Requirements

  • An everRun system requires at least one business network. Configure the business network for both nodes on the same VLAN
  • The nodes must be in the same layer-2 multicast domain
  • Connect the business networks on each node to a switch separate from the other node’s switch. VLANs used to connect the business network ports must support IPv6 and not filter any communications between the two everRun nodes

Management Network Requirements

  • By default, the management network is shared with a business network. If not shared, all requirements for a business network continue to apply
  • Configure gateways to a business LAN for remote management

* Calculate latency at 1 ms for each 100 miles of fiber, plus any latency added by non-routed, non-blocking switches or fiber converters.

Related Assets