PAPER Special Section on Next Generation Network Management Experience with Restoration of Asia Pacific Network Failures from

We explain how network failures were caused by a natural disaster, describe the restoration steps that were taken, and present lessons learned from the recovery. At 21:26 on December 26th (UTC+9), 2006, there was a serious undersea earthquake off the coast of Taiwan, which measured 7.1 on the Richter scale. This earthquake caused significant damage to submarine cable systems. The resulting fiber cable failures shut down communications in several countries in the Asia Pacific networks. In the first post-earthquake recovery step, BGP routers detoured traffic along redundant backup paths, which provided poor quality connection. Subsequently, operators engineered traffic to improve the quality of recovered communication. To avoid filling narrow-bandwidth links with detoured traffic, the operators had to change the BGP routing policy. Despite the routing-level first aid, a few institutions could not be directly connected to the R&E network community because they had only a single link to the network. For these single-link networks, the commodity link was temporarily used for connectivity. Then, cable connection configurations at the switches were changed to provide high bandwidth and next-generation Internet service. From the whole restoration procedure, we learned that redundant BGP routing information is useful for recovering connectivity but not for providing available bandwidth for the re-routed traffic load and that collaboration between operators is valuable in solving traffic engineering issues such as poor-quality re-routing and lost connections of single-link networks.


Introduction
As the Internet grows, networks become larger and more complex, and the number of components, such as routers, switches, and fiber cables, increases.In complicated network systems, it is difficult to implement global network management across several Internet service providers (ISPs) that use a lot of network components in a large-scale network topology.Fault management is a particularly important network management issue in complex network systems because the Internet has become essential to business and research.However, we are only beginning to learn how to deal with global network failures in large networks.
Failures have been reported [1], [2] in Sprint Internet protocol (IP) backbone, which shows that failures can be observed in everyday operation.However, the network fail-ures observed by Iannaccone et al. and Markopoulou et al. [1], [2] were short-lived and small scale, and their impacts were analyzed only in the context of a single ISP.Most network backup or fault restoration methods have been studied and proposed for the various layers such as wavelength division multiplexing (WDM), multi-protocol label switching (MPLS), or IP [3]- [7].Yet, the proposed backup and restoration methods have not been fully implemented and deployed in the real network.Since real networks are more complicated than theoretical ones, the impacts of network failures on users and ISPs cannot be completely predicted and analyzed.Significant network failures due to natural disasters such as earthquakes, floods, or fires could have particularly wide impact on several ISPs.
We discuss the results of the critical network failures that occurred after the Taiwan earthquake in Dec. 2006, which cut fibers and caused network failures.We also explain how restoration methods such as automatic border gateway protocol (BGP) [8] re-routing, BGP policy change, and switch reconfiguration were conducted.We hope that the experience and knowledge we gained during the process of recovering from this huge natural disaster, which affected the global Internet, can be shared and can contribute to future Internet network management research.To the best of our knowledge, this is the first detailed study of network restoration after global network failures due to a natural disaster.
Although many natural disasters have occurred in the 21st century, until recently there had been no simultaneous outage of the global Internet backbone.However, the earthquake that occurred around Taiwan in 2006 made several Asia Pacific Research and Education (R&E) networks unreachable.At 21:26 on December 26th (UTC+9), 2006, there was a big undersea earthquake off the coast of Taiwan, which measured 7.1 on the Richter scale.This earthquake caused significant damage to the undersea fiber cable systems in that area.Several ISPs were affected because each cable system is shared by multiple ISPs.This earthquake had the effect of dividing the Asia Pacific R&E networks into an eastern and a western group.The Asia Pacific R&E networks were, in particular, seriously damaged and were fully restored after several restoration steps, including automatic BGP re-routing, BGP policy changes, and switch port reconfigurations, were taken.
The first step in recovery after the earthquake was taken automatically by BGP routers, which detoured traffic along Copyright c 2007 The Institute of Electronics, Information and Communication Engineers redundant routes.In BGP routing, there are usually multiple redundant AS paths.Redundant BGP routes have served as backup paths but have provided poor quality connectivity, i.e., long round trip time (RTT).Because of the congestion on the narrow-bandwidth link that was subsequently reported, operators took manual control of traffic to improve communication quality.The second step was a traffic engineering process intended to prevent narrow-bandwidth links from filling up with detoured traffic.The operators changed the BGP routing policy related to the congested ASs.In spite of the routing-level restoration, a few institutions were still not directly connected to the R&E network community because they had only a single link to the network.For these single-link networks, the commodity link was used temporarily for connectivity.However, the commodity link was not stable and not sufficient to carry a huge amount of bandwidth or to provide next generation Internet service.To restore the single-link networks, cable connection configurations at the switches were changed.
The fiber break caused by the Taiwan earthquake raised restoration issues related to BGP re-routing.In such an emergency, the backup routes should be chosen based on available bandwidth and RTT.Since the fiber break required an urgent network recovery process, network operators configured re-routing based on their experience with bandwidth and RTT.
From this experience, we have learned that redundant physical backup links and routes are important to providing bandwidth and connectivity and that the Quality-of-Service (QoS) after recovery is also important.From the viewpoint of restoration after network failures, there are still challenges that cannot be automatically overcome by network management systems.A systematic risk management plan that includes collaboration among operators of the nextgeneration Internet is needed.
The remainder of this paper is as follows.In Sect.2, the Asia Pacific R&E networks that were damaged by the earthquake or related events are introduced.Section 3 is a detailed report of the network failures that were observed after the earthquake.Section 4 describes the processes to restoring the disrupted communications in the area.Section 5 discusses what we have learned from the observation of the network failures and recovery processes.Finally, we conclude the paper in Sect.6.

Asia Pacific R&E Networks
2.1 Asian Internet Interconnection Initiative (AIII) [9] AIII, which is the first next-generation R&E network project in Asia, was started in 1996.The basic idea of AIII is to build an Internet service with satellites to countries that do not have a wired infrastructure.AIII members are located in TH, MY, HK, ID, and SG † .
Recently, NP joined the project.However, AIII was not very successful because of its limited bandwidth (from 1.5 Mbps to 8 Mbps), which is not wide enough to support high-bandwidth research activities.These days, AIII concentrates on developing and deploying advanced network technologies such as IPv6 unicast/multicast, uni-directional link routing (UDLR), and advanced TCP.
2.2 Asia-Pacific Advanced Network (APAN) [11] APAN, which was started in 1997, is a research consortium of advanced Asian networks.High bandwidth and collaborative NOC are provided to interconnect each advanced Asian network.These include JGNII, SINET, KOREN, Sin-gAREN, CERNET, CSTNET, AARNET, TransPAC2, and others.The basic operating policy of APAN is to support high performance data transfer service to research and educational communities in the Asia-Pacific area.

Trans-Eurasian Information Network 2 (TEIN2) [12]
The Trans-Eurasian Information Network, TEIN or TEIN1, is the EU project that connects Europe to Korea.The bandwidth of TEIN has started at 10 Mbps.TEIN2, with an upgraded bandwidth of 45 Mbps to 1 Gbps, interconnects Southeast Asian and European networks.

Fiber Breaks
On December 26th, 2006 (UTC+9) there were two huge earthquakes near Taiwan.The first earthquake happened at 20:26 (UTC+8) [13], and the second one at 20:34 (UTC+8) [14].Fortunately, the earthquakes took place under the sea and the cities in TW were not heavily damaged, as happened in 1999.However, these two earthquakes did cause landslides over a wide area on the seabed near Taiwan.At 04:00 (UTC+9) on December 27th 2006, that is, after the second earthquake, the R&E networks in the Asian area were shutdown.The cable companies investigated the reason for the lost connection and found that the earthquake had caused damage to the cable systems.
The circle in Fig. 1 shows the area where the cable systems were cut off.Most of the fiber cables in the eastern Asia area went through southwestern Taiwan.These cables were generally bought and shared by different telecom companies.

Internet Disconnections and Lost BGP Peerings
After the earthquake, both commodity Internet traffic and R&E traffic were cut off.For instance, the JP-PH, JP-CN, JP-SG, CN-US, HK-KR, TW-(HK+CN), and TW-SG connections were lost.That is, the R&E network communities were divided into two groups.One was the group that consisted of JP, KR, TW, and US, and the other consisted of CN,  HK, VN, MY, TH, SG, ID, and PH (Fig. 2).
The Internet disconnection occurred in the following order.
1. Link-layer disconnection because of the fiber cut 2. Lost primary BGP peerings 3. Automatic BGP re-routing along the alternative peer if any Figure 3 shows the Internet traffic weather map at APAN Tokyo XP [16], which displays connectivity and link utilization in real time.In Fig. 3 it can be observed that the JP-HK-CN, JP-TH, JP-SG, and JP-PH communications were lost and that there was 0% link utilization except a 100 Mbps load between JP and KR.
The link-layer disconnection caused BGP sessions to expire.BGP peerings from JP to HK+CN, TH, SG, and PH were lost and automatically diverted to detour routes.
In general, when traffic is transferred to the detour AS routes, the traffic will flow along the longer AS path rather than the usual one, because the shortest AS path will be selected as the primary AS path according to BGP policy.
Figure 4 shows the AS path changes of each IP prefix  observed just after earthquake from the QGPOP BGP router in JP.It can be seen in Fig. 4 that more than 1,000 IP prefixes experienced AS path changes after the earthquake.

Traffic Load Changes
Fortunately, despite the earthquake, the CN-KR and KR-JP cables were unbroken.Therefore, we were able to observe the detour traffic along these links due to BGP re-routing.
Figure 5 shows the traffic between JP and KR on Dec. 27th, 2006.At about 04:30 (UTC+9) the inbound traffic pattern between JP and KR had changed dramatically.At the same time, as can be seen in Fig. 6, the traffic between JP and PH disappeared.
The reason for the traffic change could be inferred from the routing policy of APAN Tokyo XP.The route from CN to JP through KR was one of the lowest priority routes, but after the earthquake, it was chosen because there were no available BGP routes with high priorities.

Changes of BGP Routing Tables
Table 1 shows the registered routing policy table of the Asia Pacific R&E networks.It can be observed that ASTI (PH) lost connectivity to the R&E networks, that CSTNET (CN) lost eastbound routes, and that APAN-JP (JP) lost connectivity to TEIN2.
CERNET (CN) had a direct connection to TW, but after the earthquake its connection was expected to be changed to the path through the US, as shown in Fig. 7. But, the routing policy arrangement was the different between JP and KR.The forward path between CN and TW chose the route through JP but the return path chose the route through US as shown in Fig. 8.
In addition, routing from JP to the west Asian networks was connected through the US.The direct link between CSTNET (CN) and the US was also damaged.
Both CERNET and APAN Tokyo XP expect that the detour for JP-CN traffic should be through the US, not through KR. Figure 7 shows the expected BGP route and the    actual route between APAN-TW and CN.Since only APAN Tokyo XP implemented the strict routing policy, the CN traffic chose the shortest AS path.However, the traffic from JP to CN chose a routing policy that does not choose the route through KR.

AS-Level Topology Changes
To investigate the BGP route changes in detail, we used an AS-topology visualization tool called ABEL2 [18] that utilizes BGP routing tables that are stored every 10 minutes.shown in Fig. 10, the route from APAN-KR to G ÉANT was switched through CERNET at 20:30 on Dec. 27th (UTC+9),    2006.At 20:50 on Dec. 17th, 2006 (UTC+9), the TEIN2-SG NOC announced the route to EU, too.Since the link bandwidth between SG and KR is larger than that between CN and KR, the operator made a configuration for BGP routers to choose the AS path with the KR-SG link (Fig. 11).

Delay Changes
Figure 12 shows the RTT between SG and JP just after the earthquake and the recovery process.
On December 27th after the fiber cut, automatic BGP re-routign has been carried out between SG and JP.The route from SG to JP became SG-AU-Hawaii-JP instead of the direct link and its RTT was increased to 426 ms, while its normal RTT is around 88 ms.
When the link between SG and KR was temporarily recovered with the backup fiber, the RTT values was 240 ms between SG and JP via KR on December 28th.
Finally, on January 12th when the link between JP and SG was recovered with the direct fiber, the RTT was reduced to 113 ms, which is slightly increased than the usual case.

Network Failure Restoration Methods
After the network failures caused by the earthquake, several restoration steps were taken to restore communication.In this section, we discuss these steps.

Automatic BGP Re-Routing
Usually, the full BGP routing table includes a few useless routes (Table 2).By useless we mean that the route itself provides only connectivity with the long RTT and insufficient bandwidth.Therefore, the network operators filter out such useless routes by setting the local preference to ignore them.However, after the earthquake, these useless BGP routes worked automatically as backup paths.In the Asia Pacific R&E networks, the routes became very complicated after TEIN2 started because TEIN2 provided a few unexpected routes around the world.Because there were backup AS paths, automatic BGP re-routing could be used for first aid to provide the connectivity to the ASs that lost the primary paths.However, automatic BGP re-routing did not consider the traffic engineering parameters of the available bandwidth and the backup traffic load.

Traffic Engineering with BGP Policy Change
BGP by itself does not provide any information regarding link capacity or available bandwidth.Moreover, due to recent VLAN [19] technology, the distance between two ASs  has no relation to physical distance.Thus, QoS information of the detour routes must be examined by the operators.This makes systems reliant on human knowledge of traffic engineering.To remove the congestion due to the long detour AS path, we changed the BGP routing policy as shown in Fig. 10 and Fig. 11.
The members of TEIN2 (VN, MY, SG, ID, PH) lost their connections to APAN Tokyo XP because the fiber broke.AARNET NOC proposed backup routes for accessing APAN Tokyo XP through AU and Hawaii.However, this solution caused congestion on both the CN-KR and Hawaii-JP links.Besides, CN traffic took an asymmetrical path.
Therefore, to solve the traffic engineering issue, Tokyo XP made a decision to divide CN traffic by announcing CN IP prefixes through KR NOC and grouping CN prefixes at Tokyo XP.The results were monitored by Cisco NetFlow [20].The operators found out that half of the KR traffic was from CN.After a careful examination, it was discovered that a part of the CN traffic was from CERNET but the other part was from TEIN2.
Figure 13 shows the traffic load for each source AS.Although the total traffic is about 0.4 Gbps, the real KR traffic was about 0.2 Gbps.0.1 Gbps is occupied by CERNET traffic and the rest by TEIN2.

Port Reconfiguration
In spite of the recovery steps, a few sites that had only single-link connections to the Internet could not directly reach the R&E networks.Therefore, after the fiber was fixed, a few single-link sites had the connections via Internet commodity service.To fix this problem, the operators had to change the port configuration at switches that were able to provide high performance connections.
Two days after the earthquake, the link between KR and SG was restored and SG started making routing announcements to TEIN2 members (ID, SG, TH, MY).However, the link between SG and KR was not the original fiber, and its RTT increased greatly because a detour route was assigned.Similarly, the direct link between PH and JP was replaced with a detour route with a long RTT.When an online demonstration was being prepared for the 2007 APAN meeting held in PH between January 22nd and 26th 2007, the restored JP-PH link had a very long RTT because it went through mainland China, as shown in Fig. 14.
During the 2007 APAN meeting demonstration, the Prince of Wales Hospital at the Chinese University of Hong Kong (CUHK) was expected to join.However, its IP prefix was announced only by CSTNET.Thus, it was only reachable over a commodity Internet link with a small bandwidth.To solve this problem, APAN Tokyo XP changed the port configuration of fibers.That is, the CSTNET fiber for accessing APAN Tokyo XP was plugged to the TEIN2-HK router.Finally, CUHK was directly connected to TEIN2.Finally, with this solution, CUHK and CSTNET were supplied with huge-bandwidth and short-RTT connections to the R&E network.

Lessons
From the process of recovering from network failures across several ISPs in Asia Pacific R&E networks, we encountered several network management challenges especially regarding fault management.In this section, we describe the lessons learned during the recovery operations.

Fault-Tolerant Fiber Topology Design
Neither the customers nor the NOC engineers consider fiber topology.However, a fault-tolerant physical fiber topology is important to providing the backup routes in case of link failures.Generally, the engineers design multiple fiber connections for backup.Our experience showed that multiple fiber cores should be prepared in separate conduits in different areas, because multiple fibers could break in the same region.

Traffic Engineering-Aware BGP Routing against Link Failures
BGP routing policies are usually made to avoid asymmetric or useless AS paths by setting the appropriate local preference values.However, these alternative AS paths worked as backup paths.Before the network failures from earthquake, Asia Pacific R&E network operators thought that removing the useless routes was urgent, because routing became too complicated after TEIN2 started.However, this complicated routing was able to provide valuable connections during network failures.This shows that maintaining full-mesh style routing information is very important for fault-tolerant routing.
Though BGP re-routing over the redundant AS paths was successful for the first step in restoration, it was not sufficient to provide full backup service without congestion by considering the traffic load.Since BGP routing does not carry QoS information, such as link capacity, link utilization, or available bandwidth, traffic re-routed to the backup AS path had experienced poor QoS, such as long delays.Therefore, QoS-aware BGP routing or traffic engineeringaware BGP routing is necessary.

Integrated Network Management
During the restoration process, we used various network monitoring tools such as an MRTG, a network weather map, a BGP routing table visualizer, and a flow monitor.At first, the link outage was noticed on the network weather map, and the abrupt change of traffic load was noticed on the MRTG.However, the fast fault detection method that encompasses physical, link, routing, and application layers is necessary because it was able to identify the exact failure points and visualize their impacts on the network.In addition, a simulator or emulator that could show the results with the network topology and the traffic load before and after failures would be very useful in predicting the effects of fault-management decisions.While we took various restoration steps, we have to process the information collected by each different network-monitoring tool.Finally, the operators interpreted the situation and implemented recovery decisions manually.If the iperf [21] or bwctl [22] is available throughout the network, the end-to-end available bandwidth between ASs can be easily estimated.For example, to access Sydney from Tokyo, there are two possible routes.One is Tokyo-Seattle-Sydney, and the other is Tokyo-Honolulu-Sydney.The former provides 10 Gbps but has a long RTT.The latter route includes a bottleneck along the 155 Mbps path but has a short RTT.In addition, to make the final decision, we had to check the flow data, because MRTG [23] or RRDTool [24] do not classify traffic breakdowns by their source/destination ASs.When the traffic from KR in-creased suddenly, the operators could not understand why.This shows that integrated network monitoring or management systems would be very useful for collecting information from several independent monitoring systems and for providing the correct information in an integrated wide view in case of significant network failures.

Emergency Communication between Operators
After the earthquake, communication among NOCs was difficult because the fiber break disrupted VoIP and legacy telephone service.Moreover, the earthquake happened on December 26th 2006, overlapping with the Christmas holiday.Thus, all the communication was routed over the instant messaging system and e-mails were routed over the detoured network even though it provided poor quality service.It became obvious that the emergency communication should be guaranteed in case of failures so that the recovery process can be started quickly.

Conclusion
Since the Internet continues to grow globally and becomes ever more important in daily life, business, and research, the need for fault-tolerant service in network management becomes more urgent.However, during the network failures caused by the 2006 earthquake, it was shown that there are still many challenges in fault-tolerant network management research.
Even though multiple fiber cores are installed together to provide backup service, they may be useless during severe natural disasters.Therefore, full-mesh or fiber-disjoint physical network topology should be designed for use during failures.On the available topology, it was seen that BGP routing provided backup AS paths, which was useful for the first step in restoration.However, the traffic engineering issues during restoration were difficult to solve because all the information, such as link capacity, available bandwidth, link delay, traffic load, and routing policy, had to be collected, interpreted, and acted on by human operators.In spite of BGP re-routing, we had to deal with a few single-link ASs to establish direct connections to the R&E networks.
From this experience of network recovery during a significant natural disaster affecting several different countries and ISPs, we were able to gather valuable information on network management during emergencies.Therefore, in the Internet of the future, designers should focus on fault-tolerant network management study including robust physical topology, cross-layer restoration, traffic engineering combined with BGP routing, and simulation of failures in the network.
Tokyo XP both in their capacity as paid staff and as volunteers.Yutaka Watanabe, OTCs director, took the MRTG snapshot of Figs.

Figures 9 ,
10, and 11  show the changes between APAN-KR (AS9270) and G ÉANT (AS20965) by the number of IP prefixes.It can be observed in Fig.9, that just after the earthquake, at 19:30 on December 27th (UTC+9), 2006, the route from APAN-KR to G ÉANT was diverted to the long AS path APAN-KR -APAN-JP -TransPAC (US) -Abilene (US) -G ÉANT because of the TEIN2 link outage.Therefore, to connect APAN-KR to G ÉANT with a shorter AS path, the operator configured the BGP routing policy to make CERNET (CN) announce G ÉANT prefixes.As

Fig. 12
Fig.12RTT between SG and JP during the restoration process.
5 and 6.AARNET NOC offered us the backup routes to access the TEIN2.KOREN NOC worked very hard to keep communication open between CERNET and ASNET even during the holiday.Hawaii University NOC worked very hard to keep control of this complicated routing.KDDI investigated the reason for the communication failures in the APAN area and gave useful advice to APAN Tokyo XP.The staff of Genkai XP NOC accepted CERNETs traffic from KOREN by upgrading the JP-KR link bandwidth.Due to these great efforts and the collaboration among the network engineers, we were able to quickly restore the Asia Pacific R&E networks.This research was supported by the MIC (Ministry of Information and Communication), Korea, under the ITRC (Information Technology Research Center) support program supervised by the IITA (Institute of Information Technology Assessment) (IITA-2005-(C1090-0502-0020)).

Table 2
Examples of the "useless routes."