A basic tenet for infrastructure deployment for service providers and operators is to avoid introducing any platform, system or software that could potentially destabilize their network operation. For a consistent and smooth network operation, service providers demand platforms that offer 99.999 percent availability for a down time of no longer than five minutes per year. It has been demonstrated that network outages that last 10 minutes to several hours can and will have a direct negative impact on a service provider’s business. The cost of long down times can be quantified by SLA penalty clauses, as well as to an inherent opportunity cost in terms of higher customer churn rate and a poor image in the industry.
NFV and Virtualized Network Functions have complicated this issue further. While the promise of a lower TCO is naturally tempting, service providers’ fundamentals in their requirements do not change. VNF or not, they demand carrier-grade, highly available (5 9s or better) systems to ensure that mission-critical applications are protected.
Techniques to ensure high availability there should be redundancy at the network (a shadow network), system (for example, a backup router), hardware (for example, a backup control plane card), processors or other chips. For an NFV based solution, any virtualized function that happens to perform network- and application-critical functions must also offer 5 9s availability.
1. Network protocols that handle the control planes (routing, signaling)
2. Network services (application delivery controllers, for example, DPI, CDN, firewall, load balancers)
3. Packet core SGSN-MME, S/P gateways
4. Subscriber/Business connectivity (PPP, DHCP, GTP connections and tunnels)
The advantage of SDN/VNF based software is in its capability to scale out programmatically based on a priori set of rules. However, to ensure that a connection is not lost or the network does not have to go through a major re-convergence of resources, for example, routes, the time frame for scale out must be of O (milliseconds). This could be challenging to address via scale-outs only. It is better to assign virtual machines that back up critical parts of the network operation. The VMs must reside on a different board and preferably on different servers to protect the network from software crashes that could bring a board or the entire system down. Naturally, the active VM and the stateful backup VM will communicate via some sort of “hello” protocol to be aware of each other’s state, and share updated database of resources, for example, routing tables. The backup VM could be a standby or preferably an active one for load balancing. Of course, an efficient design would include only those software entities that need protection and are afforded a separate backup VM. For example, the control plane of a router needs 1+1 backup whereas the forwarding plane can afford an N+1 backup scheme.
ETSI NFV Expert Group on Availability and Resiliency stipulated its requirement in its specification: [paraphrasing] Single point of failures for the VNFs must be prevented by deployment of “independent” NFVI domains. The implementation of NFV should consider a geographically redundant deployment to introduce high availability to VNFs.
Vendors have followed this directive, and there are some novel and viable approaches that can implement it. Two examples are Wind River’s Titanium server, which introduces both hardware redundancy and software resiliency to the VNF that run on it. Another novel approach has been taken by Stratus Computers with its Software Defined Availability, which moves downtime prevention and recovery from the hardware or the OS to an “automated” software layer. When a failure occurs, a previously paired VM is brought back up, leveraging the cloud to run the application under protection. Stratus claims that with their SDA “any application with any availability need can be run in the cloud with application transparency.” The novel design stems from the company’s claim that no application code changes are required to benefit from SDA. Pairs of VMs are created between servers and the state of VMs is captured regularly and asynchronously, offering a stateful operational mode.
Clearly, the industry is on the right track for ensuring protection of VNFs that need it. The approach that is taken by vendors can be leveraged as a competitive advantage if they can demonstrate 5 9s simultaneously with efficient use of resources.