Abstract

Network-on-Chip (NoC) is a scalable and flexible communication medium for the design of multi-core based Systemon- Chip (SoC). Communication performance of NoC depends heavily on efficient routing algorithms. Dynamic routing is desirable because of its substantial improvement in communication bandwidth and intelligent adaptation to faulty links and congested traffic. In this paper, the authors propose a fault tolerant Dynamic Adaptive Routing in Networks-on- Chip (NoCs). This algorithm can be implemented on routing to detect transient and permanent faults in the network. That means the packet is able to move around the faults to their destination with a non-minimum path. In addition, the multilevel congestion control mechanism gives the ability to distribute the load and to avoid the faults. Because this procedure is based on adaptive routing, the hardware overhead and computational times are minimal. Experimental results based on an actual Verilog implementation demonstrate that the proposed dynamic adaptive routing algorithm improves the network throughput significantly compared to traditional algorithms.

Network–on-Chip (NoC) is a communication subsystem on an integrated circuit. It is typically a chip between IP cores in a System on Chip (SoC). NoC technology applies networking theory and methods to on - chip communication and brings notable improvements over conventional bus and crossbar interconnections [1]. Two kinds of faults (permanent and transient) need to be addressed in NoC architectures. There are two methods to cope with transient and permanent faults in NoC. Flow-control- based methods (e.g., [2] and [3] ), combine the error control code with the retransmission mechanism to tolerate transient faults occurring during transmission. The other is fault tolerant routing algorithm to route packets around permanently faulty routers or links to achieve fault- tolerance.

A good fault-tolerant routing algorithm should ensure “0 lost packets" whatever fault patterns might be, as long as a path exists. However, many of them (e.g., [4] and [5]) can only improve the successful arrival rate of the packet. Recently, bufferless router has been studied in NoC to achieve higher speed and lower cost than a wormhole or virtual channel router [6]–[8]. Excepting that there is one input register for each input port, there are no other buffers in the bufferless router. The fault diagnosis mechanism uses the Single Error Correcting and Double Error Detecting (SECDED) Hamming code [9] to detect both transient and permanent link faults. A hybrid Automatic Repeat Request (ARQ) and Forward Error Correction (FEC) link-level error control scheme using retransmission is proposed to handle transient faults.

In reference [8], a reconfigurable fault-tolerant deflection routing algorithm has been proposed to detect both transient and permanent fault. The disadvantage of this algorithm is that the faulty link has to be shut down bidirectionally, while transient faults last for only one cycle. In this paper, the authors propose a Fault-tolerant Dynamic Adaptive Routing Algorithm to detect transient and permanent faults in the network. Because this procedure is based on the adaptive routing, the hardware overhead and computational times are minimal. Using this algorithm the packets are able to move around the faults, and are able to reach the destination with non-minimum path.

The rest of this paper is organized as follows. Related works are reviewed in Section 1. Section 2 describes the fault model and fault detection schemes. The authors presented the experimental results in Section 3 which is followed by conclusion.

Several previous works were focused upon to detect fault Model in NoC. Y.H. Kang, T.J. Kwon and J. Draper [2] proposed a fault-tolerant flow control scheme for soft error handling in on-chip interconnection networks. The importance of this study is that it presents and evaluates a method for protecting packets at link-level for both data path and control planes with little hardware overhead, rather than relying on end-to-end recovery.

T. Moscibroda and O. Mutlu [6] proposed the elimination of buffers in the design of on chip cache-to-cache and cache-to-memory networks to improve both energy and area efficiency, as well as reduce network design complexity and router latency. The basic idea of “bufferless routing” is to always route a packet or flit to an output port regardless of whether or not that output port results in the lowest distance to the destination of the packet.

C. Feng, Z. Lu, A. Jantsch, J. Li, and M. Zhang [8] proposed a fault tolerant deflection routing algorithm to detect both transient and permanent fault in NoC and also reduce the area overhead of the router.

H. Zimmer and A. Jantsch [9] proposed a fault model notation and error-control scheme for switch-switch buses in Network-on-Chip that employs switching in a n×m mesh of switches. These switches must be as small and efficient as possible in order to limit the overhead interms of area ,power and delay in NoC.

For detecting permanent errors in NoC, Lehtonen et al. [12] proposed an in-line test method to test each adjacent pair of wires and a syndrome storing-based error detection method based on evaluation of consecutive code syndromes at the receiver.

Q.Yu and P. Ampadu [13] proposed transient and permanent error co-management method for NoC links. However, the Error Control Coding (ECC) scheme which is selected based on the noise condition can only detect transient errors.

Transient faults in NoC can be handled at both link-level and transport level by three schemes: ARQ, FEC, and hybrid ARQ/FEC. A simple and cost-effective end-to-end packet-based communication protocol to address transient faults for NoC has been proposed by M. Ali, M. Welzl, and S. Hessler [14]. Go-back-n retransmission policy is used, which means the receiver generates a single acknowledgement for a pre-defined number of packets that it receives. A link-level ARQ scheme, which is called hop-by-hop [12],utilizes a three-flit-deep retransmission buffer per virtual channel to handle transient errors in the virtual channel router. In [4], the authors have compared the end-to-end flow control with the link-level flow control in terms of latency and power consumption.

Kohler et al. [10] have proposed a link-level ARQ mechanism that uses only one additional retransmission register to handle both router-internal transient faults and transient link faults in deflection router. However, because of only one retransmission buffer, packets will be lost if the second transient link fault occurs one cycle after the first one, or the transient link fault lasts for two cycles.

Two kinds of fault-tolerant routing, which are known as stochastic and deterministic, have been proposed for NoC to handle permanent faults. Stochastic communication transfers redundant packets through different paths to avoid faults. Inorder to limit the number of packet copies, redundant random walk [15], [16] only replicates packets at the source node. Although the stochastic communication can be highly fault-tolerant, it wastes much bandwidth to transfer the redundant packets, which reduces the throughput of the network significantly. Depending on the shape of the fault region, deterministic fault-tolerant routing algorithms can be categorized into two classes: one that can handle regular fault regions (e.g., convex and concave shapes) and the other, which is also known as topology-agnostic, can handle irregular fault regions. A reconfigurable routing algorithm is proposed to route packets surrounding a faulty router in a 2-D mesh NoC [17]. But it can be only used in one faulty router topology and extended to only one faulty region topology. A resilient routing algorithm for fault-tolerant NoC based on turn model is described in [18] . However ,the routing table of the router is reconfigured to avoid faulty components in an offline process. Although the regular nature of a faulty region can be facilitated to route packets around it, it is not practical to assume the shape of the faulty region since the occurrence of the fault is not predictable.

Both Transient and Permanent fault models in NoC are detected by using following steps as shown in Figure1. Here dynamic adaptive routing algorithm is used to detect the fault and improve the network throughput.

The NoC architecture is based on a 2-D mesh topology, Nostrum NoC. Each processing element is attached to a router (R). The boundary output is connected to the input of the same router. All incoming packets are prioritized according to their hop counts, which record the number of hops that the packet has been routed. The router makes routing decision for each arriving packet from the highest priority to the lowest. If a desired output port has already been occupied by a higher priority packet, a free port with the smallest stress value will be chosen, which means the packet has to be deflected. The stress value, which can be used to balance traffic load, is the number of packets processed by neighboring routers in the last four cycles.

The FEC is used to perform error control to tolerate transient faults during packet transmission. In the case of a single-bit error in any part of the packet, the error can be corrected after the packet has been decoded. If any part of the packet contains a two-bit error for one cycle, the router which receives the packet will require the router, which sends the packet, to retransmit the packet.

The ARQ is proposed to handle Transient Faults. Go-back-n retransmission policy is used, which means the receiver generates a single acknowledgement for a predefined number of packets that it receives. Due to the lack of buffers in the deflection router, implementing link-level error control scheme on it is different from the wormhole/virtual channel router with more input buffers. A link-level ARQ mechanism that uses only one additional retransmission register should be used to handle both router-internal transient faults and transient link faults in deflection router. However, because of only one retransmission buffer, packets will be lost if the second transient link fault occurs one cycle after the first one or the transient link fault lasts for two cycles.

SECDED Hamming code, which can correct single error and detect double errors, is used to encode the packet to perform fault diagnosis. To make a compromise among performance, area and power consumption, we compare two ECC strategies:

For the first encoding strategy, eight parity bits are used to encode the 114 bits packet into 122 bits. It can only correct the one-bit error and detect the two-bit error in the packet. The second strategy divides the packet into seven parts: two for head and five for payload, encoded with Hamming (23, 17) and Hamming (22, 16) codes, respectively. The encoded packet length has 156 bits with 42 parity bits. It can correct seven simultaneous single-bit errors in each part and detect at most 14 error bits in the case of a two-bit error in each part. If any of the seven parts contains a two-bit error for one cycle, it will lead to a retransmission. If the retransmitted packet has the same two-bit error in any part, the link will be tested to check if it is considered as a permanent faulty link. The input register is followed by the decoder, so the ECC mechanism can detect and correct both link and input register errors.

This shows the latency, area, and power comparison of encoders and decoders for Hamming (122, 114) and (22, 16) codes, respectively (developed in VHDL and synthesized with TSMC 65-nm technology). The results for Hamming (23, 17) code are similar as Hamming (22, 16) code. As the table illustrates, the latency, area, and power consumption of the encoder and decoder for Hamming (122, 114) code are much larger than those of Hamming (22, 16) code. Partitioning the packet into smaller blocks and encoding separately is clearly a better strategy for improving performance and lowering area and power consumption. From the synthesis results, the number of logic levels for the encoding and decoding circuits of H(122, 144) code are 14 and 16 respectively, while the number of logic levels for the encoding and decoding circuits of H(22, 16) are 7 and 10, respectively. Using smaller blocks can reduce the number of logic levels for the encoder and decoder. In addition, different parts of packet in the ECC strategy 2) are encoded with interleaving, which can also protect against burst errors. To correct multiple error bits, the Bose, Chaudhuri and Hocquenghem (BCH) code be constructed. However, as the number of bits to be corrected increases, the hardware complexity and decoding time will increase significantly. As illustrated, the hardware overhead for BCH code is much larger than that of the Hamming code with interleaving.

Dynamic adaptive Routing describes the capability of a system, through which routes are characterized by their destination, to alter the path that the route takes through the system in response to a change in conditions. The adaptation is intended to allow as many routes as possible to remain valid (that is, have destinations that can be reached) in response to the change.

Consider a node–table-routing architecture in which the routing table is stored at each router. The destination of the header flit will be checked, and it will decide the routing direction based on the routing-table entries. In contrast to the table-based routing in which a routing algorithm computes the route or next hop of a packet at runtime, algorithmic routing is more restrictive to simple routing algorithms and can only be applied on regular topologies, such as a mesh topology. The routing-table approach enables the use of per-hop network state information, such as queue lengths, to select among several possible next hops at each stage of the route. At each node unit, there are k inputs from the k neighbor nodes for the expected costs. The output of the unit at node ni is the updated expected cost V (i, j) and is sent to all adjacent nodes. For each destination j and direction k, the expected cost will be computed, and the minimum cost will be selected. The optimal direction for routing is selected and used to update the routing table, as stated. Although the algorithm consists of two for loops, this can be realized in a hardware with a parallel architecture, and the computational-delay complexity can be reduced to linear. In adaptive routing, the routing table is updated dynamically or periodically, such that the communication traffic can be altered subject to the choice of switching mechanisms.

In performance analysis, Dynamic router under transient faults of the proposed method is compared with the existing method.

In this paper, the performance of the Fault-tolerant Dynamic adaptive routing algorithm is illustrated in 8×8 2-D mesh simulator in VHDL. The routers are compared in terms of throughput and average hop count in the presence of both permanent and transient faults. The result of the proposed technique is shown Figure 2 and 3.

The energy level and the throughput level for transient and permanent fault model are improved significantly using dynamic adaptive routing algorithm.

SECDED test can detect both transient and permanent fault and corrects the faulty bits depending upon the fault over the link. The faulty nodes get corrected and the packets are transmitted from source to destination across the network. The Figure 4 shows the output for SECDED Test.

This paper presents a fault tolerant dynamic adaptive routing algorithm as used in buffer less NoC to protect it from transient and permanent fault on the links. In addition, the multi-level congestion control mechanism gives the ability to distribute the load and to avoid the faults. Because this procedure is based on the adaptive routing, the hardware overhead and computational times are minimal. Implementation demonstrate that the proposed dynamic adaptive routing algorithm improves significantly the network throughput compared to traditional algorithms.

[4]. S. Pasricha, Y. Zou, D. Connors, and H. J. Siegel, (2010). “OE+IOE: A novel turn Model Based Fault Tolerant Routing Scheme For Networks-on-Chip,” In Proceedings of 8^th IEEE/ACM/IFIP International Conference on Hardware/ Software Codesign System Synthesis, pp. 85–94.

[5]. A. Patooghy and S. G. Miremadi, (2009). “XYX: A Power and Performance Efficient Fault-Tolerant Routing Algorithm for Network on Chip,” In Proceedings of 17^th Euromicro International Parallel, Distribution on Network - Based Process Conference, pp. 245–251.

[8]. C. Feng, Z. Lu, A. Jantsch, J. Li, and M. Zhang, (2010). “A Reconfigurable Fault Tolerant Deflection Routing Algorithm Based on Reinforcement Learning for Networkrd on-Chip,” In Proceedings of 3 international Workshop Network on Chip Architecture, pp.11–16.

[9]. H. Zimmer and A. Jantsch, (2003). “A Fault Model Notation and Error-Control Scheme for Switch-to-Switch st Buses in a Network-on-Chip In Proceedings of 1 IEEE/ACM/IFIP International Conference on Hardware/ Software Codesign System Synthesis, pp. 188–193.

[10]. A. Kohler, G. Schley, and M. Radetzki, (2010). “Fault Tolerant Network on Chip Switching with Graceful Performance Degradation,” IEEE Transaction on Computer Aided Design Integration on Circuits Systems, Vol.29(6), pp. 883–896.

[11]. C. Grecu, A. Ivanov, R. Saleh, E. S. Sogomonyan, and P. P. Pande, (2006). “Online Fault Detection and th Location for NoC Interconnects,” In Proceedings of 12 IEEE International Conference Online Test Symposiam, pp. 145–150.

[12]. T. Lehtonen, D. Wolpert, P. Liljeberg, J. Plosila, and P. Ampadu, (2010). “Self Adaptive System for Addressing Permanent Errors in on-Chip Interconnects,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol.18( 4), pp.527–540.

[16]. M. Pirretti, G. M. Link, R. R. Brooks, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin, (2004). “Fault Tolerant Algorithms for Network-on-Chip Interconnect,” In Proceedings of IEEE Computer Socitey Annual Symposiyam on VLSI, pp. 46–51.

Fault-Tolerant Dynamic Adaptive Routing in NoC

Abstract

Keywords :

Introduction

1. Related Works

1.1 Fault Detection Mechanisms for NoC

1.2 Techniques to Handle Transient and Permanent Faults for NoC

2. Methodology

2.1 NoC Model

2.2 Forward Error Correction

2.3 Automatic Repeat Request

2.4 Link Level Fault Detection

2.5 Dynamic Adaptive Routing

2.6 Performance Analysis

3. Experimental Results

Conclusion

References