The Internet is increasingly used to transport time-critical traffic. Applications like video conferencing, television, telephony and distributed games have strict requirements to the delay and availability offered by the underlying network. At the same time, connectivity failures caused by failures in network equipment is a part of everyday operation in large communication systems. The traditional recovery mechanisms used in IP networks are not designed with real-time applications in mind. The distributed nature of popular intradomain routing protocols allows them to eventually recover from any number of failures that leaves the network connected, but this isa time consuming process that can lead to unacceptable performance degradations for some applications.
In this work, we argue that there is a need for fast recovery mechanisms that allow packet forwarding to continue over alternate paths immediately after a failure, before the routing protocol has converged on the altered topology. To give rapid response, such mechanisms should be proactive in the sense that an alternate route is readily available when a failure is discovered, and local, so that the recovery action can be effected by the node that discovers the failure. Further, care should be taken so that the shifting of recovered traffic to an alternate route does not lead to congestion and packet loss in other parts of the network.
We present and investigate mechanisms that can respond quickly to failures or unexpected traffic shifts in the network. First, we evaluate the recovery strategy used in a network protocol called Resilient Packet Ring (RPR). The ring topology used in RPR allows the implementation of very fast protection mechanisms. We look at the performance of these mechanisms, and propose improvements that reduce packet loss and shorten the experienced disruption time after a link or node failure. Then, in the main part of this work, we focus on fast recovery in general mesh networks. We present Resilient Routing Layers (RRL) and Multiple Routing Configurations (MRC), which are methods for near-instantaneous recovery from component failures in packet networks. We discuss and evaluate our mechanisms with respect to state requirements and distribution of the recovered traffic. For MRC, we move on to present methods for reducing the chances of congestion after a recovery operation. We show that if we have knowledge about the traffic demands, we can use this information to create MRC recovery paths that avoid the most heavily used parts of the network. Finally, we show how the concepts used in RRL and MRC to give recovery from component failures also can be used to avoid congestion when there are sudden shifts in the traffic distribution. Our method is more flexible than traditional traffic engineering methods used in connectionless IP networks, since it does not involve changing link weights to respond to a changed traffic situation.
Fast recovery mechanisms like those proposed in this work can help improve the stability and availability of IP networks. This is an important requirement for enabling new and existing real-time applications over general-purpose Internet infrastructure.
Analysis and improved performance of RPR protection Amund Kvalbein and Stein Gjessing Published: 12th IEEE International Conference on Networks (ICON). Pages 119-124, vol.1. Singapore, Nov. 16-19, 2004.
Protection of RPR strict order traffic Amund Kvalbein and Stein Gjessing Published: 14th IEEE Workshop on Local and Metropolitan Area Networks (LANMAN) Chania, Crete, Sept. 18-21, 2005.
Fast recovery from link failures using Resilient Routing Layers Amund Kvalbein, Audun Fosselie Hansen, Tarik Cicic, Stein Gjessing and Olav Lysne Published: 10th IEEE Symposium on Computers and Communications (ISCC). Pages 554-560. Cartagena, Spain, Jun. 27-30, 2005.
Fast IP Network Recovery using Multiple Routing Configurations Amund Kvalbein, Audun Fosselie Hansen, Tarik Cicic, Stein Gjessing and Olav Lysne Published: 25th Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM). Barcelona, Spain, Apr. 23-29, 2006.
Post-Failure Routing Performance with Multiple Routing Configurations Amund Kvalbein, Tarik Cicic and Stein Gjessing Published: 26th Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM). Anchorage, Alaska, May 6-12, 2007.
Robust Load Balancing using Multi-Topology Routing Amund Kvalbein and Olav Lysne Published: ACM SIGCOMM Workshop on Internet Network Management, Kyoto, Japan, Aug. 27-31, 2007