Enabling High Availability for CloudStack Virtual Routers

CloudStack Virtual Router Redundancy is the primary mechanism for ensuring high availability within isolated and Virtual Private Cloud (VPC) network tiers. In a standard cloud architecture, the Virtual Router (VR) serves as the singular gateway for all ingress and egress traffic; it manages critical services such as DHCP, DNS, NAT, and Site-to-Site VPN connections. Without redundancy, the failure of a single VR instance results in an immediate loss of connectivity for all guest instances within that tier. This manual addresses the architecture, deployment, and auditing of the CloudStack Virtual Router Redundancy protocol, which utilizes the Virtual Router Redundancy Protocol (VRRP) to maintain stateful availability. By implementing a Master-Backup pair, administrators can mitigate the risks of hardware failure, hypervisor crashes, and kernel panics. The solution ensures that network state remains synchronized, minimizing packet-loss and reducing the latency associated with gateway failover during critical infrastructure events.

TECHNICAL SPECIFICATIONS

THE CONFIGURATION PROTOCOL

Environment Prerequisites:

Successful implementation of CloudStack Virtual Router Redundancy requires specific infrastructure conditions. First, the CloudStack Management Server must be running version 4.11 or higher to support enhanced HA features for VPCs. The underlying hypervisors (KVM, XenServer, or VMware ESXi) must have hardware virtualization enabled and must be part of a cluster with shared storage to facilitate rapid instance recovery if necessary. On the network side, the physical switches must allow Protocol 112 (VRRP) traffic across the guest network VLANs. If a firewall sits between hypervisors, it must permit traffic for the VRRP advertisements. Users must possess Root or Domain Admin privileges within the CloudStack UI or API to modify Network Offerings and VPC settings. Finally, consistent NTP synchronization across all hosts is mandatory to ensure that log timestamps and heartbeat intervals do not drift, which could lead to a split-brain condition.

Section A: Implementation Logic:

The engineering design of CloudStack Virtual Router Redundancy relies on a dual-instance deployment strategy. When a redundant network offering is selected, CloudStack orchestrates the creation of two identical Virtual Routers: one designated as Master and the other as Backup. The underlying logic is handled by the keepalived service running within the VR’s Debian-based kernel. The Master VR periodically sends VRRP advertisements to the Backup VR over the guest network interface. This heartbeat serves as a “keep-alive” signal. If the Backup fails to receive an advertisement within the defined threshold (usually three seconds), it assumes the Master has failed and promotes itself to the Master role by assuming the Virtual IP (VIP) addresses for all gateways. This process is idempotent; the reconfiguration of the IP stack happens automatically without manual intervention. The goal is to keep overhead low while ensuring that the transition occurs before the upper-layer TCP applications time out due to signal-attenuation in the virtualized link.

Step-By-Step Execution

1. Modify the Network Offering

Navigate to Service Offerings and select Network Offerings. Create a new offering or edit an existing one to enable the Redundant router capability checkbox.
System Note: This change informs the CloudStack orchestration engine to append the redundant_router=true flag to the deployment XML. This directly affects the cloud-set-guest-network script on the VR, which initializes the keepalived.conf file during the first boot. Use cmk (CloudStack Monkey) to verify the flag via API: list networkofferings id=[UUID].

2. Configure Global Settings for Heartbeat Intervals

Search for the global setting router.redundancy.check.interval and set it to the desired frequency in seconds.
System Note: Adjusting this value modifies the frequency at which the management server polls the VR status. Increasing this reduces overhead on the management server but may increase the time before a “Fault” state is reported in the UI. This does not change the VRRP interval itself but rather the visibility of the failure at the administrative level.

3. Deploy the Redundant VPC or Isolated Network

Create a new VPC using the redundant-enabled offering. Add tiers as required.
System Note: Upon creation, CloudStack will spin up two Virtual Routers (e.g., r-4-VM and r-5-VM). The management server uses the systemctl start cloud-vrasthosts service to manage the lifecycle. You can monitor the deployment progress by tailing /var/log/cloudstack/management/management-server.log.

4. Verify Router Status via Console

Access the shell of the Master VR and execute ip addr show.
System Note: The Master VR will display the guest gateway IPs assigned to its interfaces. The Backup VR will have the same interfaces but without the assigned IP addresses (they exist in a “tentative” or “down” state). Use the tool tcpdump -i eth2 proto 112 to see the VRRP packets being broadcast. This confirms that the payload of the heartbeat is successfully traversing the virtual switch.

5. Validate State Synchronization

Check the sync status of NAT rules by running cat /etc/cloudstack/redundant_router/conf/rules.json on both routers.
System Note: CloudStack ensures the two routers share the same firewall and NAT rules. This synchronization is handled by the CsRedundant python modules located in /opt/cloud/bin/. An idempotent sync script runs every minute to ensure that any change made to the Master via the API is reflected on the Backup.

Section B: Dependency Fault-Lines:

The most frequent point of failure in VR redundancy is the “Split-Brain” scenario. This occurs when the Backup VR loses connectivity to the Master but the Master is still functional. Both routers will claim the Master role and attempt to bind the same VIPs, causing a massive increase in packet-loss and MAC address flapping on the physical switches. This is often caused by MTU mismatches where the encapsulation overhead of a VXLAN or GRE tunnel exceeds the physical MTU, causing VRRP packets to be dropped. Another bottleneck is the disk I/O on the hypervisor; if the Backup VR experiences high thermal-inertia or storage latency, the keepalived process may lag, missing heartbeats and triggering a false failover.

THE TROUBLESHOOTING MATRIX

Section C: Logs & Debugging:

When a failover fails or fluctuates, the primary log to inspect is /var/log/cloud.log on the Virtual Router instance. Look for the string Keepalived_vrrp[PID]: VRRP_Instance(VI_1) Entering MASTER state. If this string appears repeatedly every few minutes, the routers are flapping.

1. Path-Specific Debugging: Inspect /var/log/keepalived.log for specific protocol errors. If you see “Permission denied” errors, check the chmod settings on /etc/keepalived/keepalived.conf.
2. Physical Fault Codes: If the hypervisor reports “vnet_link_down”, the issue is physical signal-attenuation or a disconnected pNIC. Use a fluke-multimeter or check the sensors output on the physical host to ensure that hardware failure is not the root cause.
3. Network Latency Check: Run ping -I eth2 [Backup_VR_IP] from the Master. If the latency exceeds 100ms, the VRRP advertisements may be discarded, leading to instability.
4. Logic Controller Analysis: Check the CloudStack Management server for “Unable to start Virtual Router” errors. This usually indicates a resource exhaustion issue on the hypervisor where it cannot satisfy the CPU/RAM requirements of the second VR instance.

OPTIMIZATION & HARDENING

To enhance Performance Tuning, adjust the vrrp_preempt setting within the configuration templates. Disabling preemption prevents a recovered Master from immediately reclaiming its role, which avoids a second network “hiccup” once the original failure is resolved. For high Concurrency and Throughput, ensure that the VR has at least 2 vCPUs if the tier handles more than 500 Mbps of traffic; this prevents the keepalived process from being starved of CPU cycles during high NAT translation loads.

Security Hardening is critical. Ensure that the VRRP traffic is restricted to the Guest Network only and that the firewall prevents any external ICMP or UDP traffic from spoofing VRRP advertisements. Use the command iptables -A INPUT -p 112 -i eth2 -j ACCEPT to explicitly allow heartbeats while blocking them on the public (eth1) and management (eth0) interfaces.

For Scaling Logic, as the number of tiers in a VPC increases, the overhead of managing multiple VRRP instances grows. In large-scale cloud environments, it is recommended to use “Redundant VPC Routers” rather than redundant “Isolated Networks” to consolidate the management traffic and reduce the number of system VMs the hypervisor must sustain.

THE ADMIN DESK

Q: Why are both Virtual Routers showing as Master in the UI?
This indicates a communication break between routers. Check if the guest network VLAN is trunked correctly to both hypervisors. Ensure the firewall on the VR is not blocking Protocol 112 via the iptables -L command.

Q: Does failover disconnect existing VPN sessions?
Yes. While NAT and Firewall rules are synchronized, IPsec Tunnels and VPN sessions typically require a re-handshake. The latency of this re-establishment depends on the remote gateway’s timeout settings and the throughput of the handshake process.

Q: Can I upgrade a non-redundant network to a redundant one?
You must change the Network Offering of the existing network. CloudStack will then deploy a second router. However, a brief period of packet-loss occurs during the final transition as the IP state is migrated to the new redundant pair.

Q: How do I manually force a failover for maintenance?
Execute systemctl stop keepalived on the current Master VR. The Backup will detect the loss of heartbeat and promote itself within seconds. This is the safest way to perform kernel updates or resource scaling on the Master instance.