Fault Tolerance and Reliability of Mesh Multi-computer Networks – A Review

Mostafa Abd-el-barr*
Professor, Department of Information Science, CFW, Kuwait University.
Periodicity:January - March'2010
DOI : https://doi.org/10.26634/jse.4.3.1112

Abstract

Multi-computer systems (MCSs) are efficient in solving computing-intensive problems. A MCS consists of a number of processing elements (PEs) and an interconnection network (IN). A fault in any PE and/or the IN can lead to losses in sensitive data and/or the overall throughput. In this paper, we provide a review of the fault tolerance and reliability assessing techniques of mesh MCSs. A number of fault-tolerant routing techniques are analyzed including the dimension ordering, the turn model, and the block-fault model. In reliability analysis, we consider the sub-mesh reliability exact and approximate models and the task-based reliability computation.  It is expected that some of the techniques and algorithms covered in this paper can have applications in the domain of wireless mesh networks.

Keywords

Mesh Multi-Computer Networks, Fault Tolerance, Fault-Tolerant Routing, Broadcasting in Mesh Networks, Communication in Faulty Mesh Networks, Reliability Analysis of Mesh Networks.

How to Cite this Article?

Mostafa Abd-el-barr (2010). Fault Tolerance and Reliability of Mesh Multi-computer Networks – A Review. i-manager’s Journal on Software Engineering, 4(3),7-17. https://doi.org/10.26634/jse.4.3.1112

References

[1]. Allen, F., Almasi, G., Andreoni, W., et al. (2001). Blue Gene: a vision for protein science using a Petaflop Supercomputer. IBM Systems Journal, Vol.40, No. 1, pp. 310-337.
[2]. Almohammad, B., and Bose, B. (1999). Fault-tolerant communication algorithms in toroidal networks. IEEE Transactions on Parallel and Distributed Systems, Vol. 10, No. 10, pp. 876-983.
[3]. Al-Tawil, K., Abd-El-Barr, M., and Ashraf, F. (1997). A Survey and Comparison of Wormhole Routing Techniques in Mesh Networks. IEEE Network, March/April, pp. 38-45.
[4]. Ashraf, F., Abd-El-Barr, M., and Al-Tawil, K. (1998). Introduction to Routing in Multicomputer Networks. ACM SIGARCH, Vol. 26, No. 5, pp. 14-21.
[5]. Bataineh, S., Odet-Allah, A., and Al-Omari, R. (1998). Reliability of mesh and torus topologies in the presence of faults. Journal of Telecommunication Systems, 10(3-4), pp. 389-408.
[6]. Boppana, R. and Chalasani, S. (1995). Fault-tolerant wormhole routing algorithms for mesh networks. IEEE Transactions on Computers, Vol. 44, No. 7, July 1995, pp. 848-864.
[7]. Chalasani, S., and Boppana, R. (1997). Communication in multicomputers with nonconvex faults. IEEE Transactions on Computers, Vol. 46, No. 5, May, pp. 616-622.
[8]. Chang, C.-Y., and Mohapatra, P. (1998). An Efficient Method for Approximating Submesh Reliability of Two-Dimensional Meshes. IEEE Transactions on Parallel and Distributed Systems, Vol. 9, No. 11, November 1998, pp. 1115-1124.
[9].Chen, C-L., Chiu, G-M. (2001). A fault-tolerant routing scheme for meshes with nonconvex faults. IEEE Transactions on Parallel and Distributed Systems, Vol. 12, No. 5, May 2001, pp. 467-475.
[10]. Chen, J., and Wang, T. (2007). Probabilistic Analysis on Mesh Network Fault Tolerance. Journal of Parallel and Distributed Computing, Vol. 67, No. 1, January 2007, pp. 100-110.
[11]. Chirivlla, V., Alcover, R., and Duato, J. (2001). Accurate reliability and availability models for direct interconnection networks. Proceedings International Conference on Parallel Processing, September 2001, pp. 517-524.
[12]. Chiu, G-M. (2000). The odd-even turn model for adaptive routing. IEEE Transactions on Parallel and Distributed Systems, Vol. 11, No. 7, July 2000, pp. 729-738.
[13]. Cray Research Inc., The Cray T3E scalable parallel processing system, http://www.cray.com/PUBLIC/product-info/T3E/CRAY.T3E.html
[14]. Cray Research Incoporation CRAY T3D hardware reference manual, October 1993. Also, Karamcheti, V. and Chien, A. (1995). A Comparison of architectural support for messaging in the TMC CM-5 and the Cray T3D. Proceedings of the ISCA-95, June 1995, pp. 1-10.
[15]. Dally, W. and Towels, B. (2004). Principles and Practices of Interconnection Networks. Elsevier publications, New York.
[16]. De Mello, A., et al. (2004). Evaluation of Routing Algorithms on Mesh Based NoCs. Technical Report Series, No. 40, May 2004, pp. 1-11.
[17]. Duato, J. (1994). A Theory of Fault-Tolerant Routing in Wormhole Networks. Proceedings of the International Conference on Parallel and Distributed Systems, December 1994, pp. 600-607.
[18]. Fahmy, H. and El-Hefnawy, A. (1999). On the Exact Reliability Evaluation of Mesh-Connected Processors. Proceedings of the International conference on Electronics, Circuits, and Systems (ICECS), 1999, pp. 275-278.
[19]. Farahabady, M., Safaei, F., Khonsari, A., and Fathy, M. (2006). Characterization of Spatial Fault Patterns in Interconnection Networks. Journal of Parallel Computing, 32(11-12), December 2006, pp. 886-901.
[20]. Fillo, M., et. al. (1997). The M-Machine Multicomputer. International Journal of Parallel Programming – Special Issue on Instruction-Level Parallel Processing Part II, 25(3), 1997, pp. 183-212.
[21]. Gaugh, P. and Yalamancili, Y. (1993). Adaptive Routing Protocols for Hypercube Interconnection Networks. IEEEE Computer, Mar. 1993, pp. 12-23.
[22]. Glass, C., and Ni, L. (1992). The turn model for adaptive routing. Proceedings of the 9th annual International Symposium on Computer Architecture, 1992, pp. 278-287.
[23]. Gomez, M., Nordbotten, N., Flich, J., Lopez, P., Robles, A., Duato, J., et al. (2006). A Routing Methodology for Achieving Fault Tolerance in Direct Networks. IEEE Transactions on Computers, Vol. 55, No. 4, April 2006, pp. 400-415.
[24]. Groscup, W. (1993). The Intel XP/S Supercomputer. Proceeedings of the 5th ECMWF Workshop on the Use of Parallel Processing in Meteorology, Singapore World Scientific, 1993, pp. 173-187.
[25]. Ho, C.-T., and Stockmeyer, L. (2004). A New Approach to Fault-Tolerant Wormhole routing for mesh-connected parallel computers. IEEE Transactions on Computers, Vol. 53, No. 4, April 2004, pp. 427-438.
[26]. Holland, G. and Pradhan, D. (1994). Fault-Tolerant Multiprocessor Systems. Report #94-05, Reliable Computer Tech., September 1994, pp. 1-26.
[27]. Intel Corporation (1991). A Touchtone DELTA system description.
[28]. Jiang, Z., and Wu, J. (2003). A fault-tolerant broadcasting in 2D wormhole-routed meshes. Journal of Supercomputing, Vol. 25, No. 3, July 2003, pp. 225-275.
[29]. Kazmierczak, A. (2007). A New Mesh Recognition Algorithm”, Journal of Computer Science, Vol. 1, No. 1, 2007, pp. 1-13.
[30]. Lenoski, D., et al. (1992). The Stanford DASH multicomputer. IEEE Computer, Vol. 25, No. 3, 1992, pp. 63-79.
[31]. Mahapatra, N., and Dutt. S. (2001). Hardware-Efficient and Highly Reconfigurable 4- and 2-Track Fault-Tolerant Designs for Mesh-Connected Arrays. Journal of Parallel and Distributed Computing, Vol. 61, No. 10, October 2001, pp. 1391-1411.
[32]. Mohapatra, P. (1998). Wormhole Routing techniques for Directly Connected Multicomputer Systems. ACM Computing Surveys, Vol. 30, No. 3, September 1998, pp. 374-410.
[33]. Mohapatra, P. and Das, C. (1995). On Dependability Evaluation of Mesh-Connected Processors. IEEE Transactions on Computers, Vol. 44, No. 9, September 1995, pp. 1073-1084.
[34]. MP-1 family data-parallel computers. MasPar Computer Corporation, (1987). 749 North Mary Av., Sunnyvale, CA.
[35]. Najaf-Abadi, H., Sarbazi-azad, H., and Rajabzadeh, P. (2004). Performance modeling of fully adaptive wormhole routing in 2D mesh-connected multiprocessors. Proceedings, 12th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, October 2004, pp. 528-534.
[36]. Ni, L. and McKinley, P. (1993). A survey of wormhole routing techniques in direct networks. IEEE Computer Magazine, Vol. 26, No. 2, February 1993, pp. 62-76.
[37]. Noakes, M. and Dally, W. (1990). System design of the J-machine. Proc. 6th MIT Conference on Advanced research in VLSI, 1990, pp. 179-194.
[38]. Peterson, C., et. al. (1991). iWRAP: a 100-MPOS VLIW microprocessor for multicomputer. IEEE Micro, Vol. 11, No. 13, May 1991, pp. 81-87.
[39]. Pyo, K., and Han, T. (1996). Fault-tolerant wormhole routing in 2D mesh with overlapped solid fault regions. Technical Report, CS/TR-96-101, December 16, 1996, pp. 1-27.
[40]. Rezazad, M. and Sarbazi-azad, H. (2005). The effect of virtual channel organization on the performance of interconnection networks. The 19th IEEE Symposium on Parallel and Distributed Processing, April 2005, 8 pages.
[41]. Sarbazi-Azad, H., Ould-Khaoua, M., and Zomaya, A. (2006). Performance Evaluation of communication networks for parallel and distribute systems. Journal of Parallel Computing, Vol. 32, No. 12, December 2006, 775-776.
[42]. Shih, J-D. (1997). Adaptive Fault-Tolerant Wormhole Routing Algorithms for Hypercube and Mesh Interconnection Networks. Proceedings 11th International Symposium on Parallel Processing, April 1997, pp. 333-340.
[43]. Su, C-C., and Shin, K. (1996). Adaptive fault-tolerant deadlock-free routing in meshes and hypercubes. IEEE Transactions on Computers, Vol. 44, No. 6, June 1996, pp. 666-683.
[44]. Suh, Y., and Yalmanchili, y. (1998). All-to-all communication with minimum start up costs in 2D/3D Tori and meshes. IEEE Transactions on Parallel and Distributed Systems, Vol. 9, No. 5, 1998, pp. 442-458.
[45]. Sui, P-H. and Wang, S-D. (1997). An improved algorithm for fault-tolerant wormhole routing in meshes. IEEE Transactions on Computers, Vol. 46, No. 9, September 1997, pp. 1040-1042.
[46]. Theiss, I. (2004). Modularity, Routing, and Fault Tolerance in Interconnection Networks. Doctoral Dissertation, University of Oslo, Feb. 2004.
[47]. Tsai, M. (2004). One-Staged Wormhole Routing for Irregular faulty patterns in Meshes. Proc. Of the 10th International Conference on Parallel and Distributed Systems (ICPADS'04), July 2004, pp. 569-576.
[48]. Wang, D. (2007). A heuristic fault-tolerant routing algorithm in mesh using rectilinear-monotone polygonal fault blocks. Journal of Systems Architecture: the EUROMICRO Journal, Vol. 53, No. 9, September 2007, pp. 619-628.
[49]. Wu, J. (2000). Fault-Tolerant Adaptive and Minimal Routing in Mesh-Connected Multi-computers Using Extended Safety Levels. IEEE Transactions on Parallel and Distributed Systems, Vol. 11, No. 2, February 2000, pp. 149-159.
[50]. Wu, J. (2003). A fault-tolerant and deadlock-free routing protocol in 2D meshes based on odd-even turn model. IEEE Transactions on Computers, Vol. 52, No. 9, September 2003, pp. 1154-1169.
[51]. Wu, J. (2003). A Simple Fault-Tolerant Adaptive and Minimal Routing Approach in 3-D meshes. Journal of Computer Science and Technology, Vol. 18, No. , Jan. 2003, pp. 1-13.
[52]. Wu, J., and Jiang, Z. (2002). Extended Minimal Routing in 2-D Meshes with Faulty Blocks. Proc. 2nd International Conference on Distributed Computing Systems Workshops (ICDCSW-02), 2002, pp. 49-54.
[53]. Wu, J. and Jiang, Z. (2005). On Constructing the Minimum Orthogonal Convex Polygon for the Fault-Tolerant Routing in 2-D Faulty Meshes. IEEE Transactions on Reliability, Vol. 5, No. 3, Sept. 2005, pp. 449-458.
[54]. Wu, J., and Wang, D. (2005). Fault-Tolerant and deadlock-free routing in 2D meshes using rectilinear-monotone polygonal fault blocks. The International Journal of Parallel, Emergent and Distributed Systems, Vol. 20, No. 2, June 2005, pp. 99-111.
[55]. Xie, Y., Li, Z., and Li, F. (2005). Hexagon Mesh Interconnection Networks. Proceedings SPIE, Feb. 2005, pp. 991-998.
[56] Yoo, B., and Das, C. (2002). A fast and efficient processor allocation scheme for mesh-connected multicomputers. IEEE Transactions on Computers, Vol. 51, No. 1, 2002, pp. 46-60.
[57]. Zhou, J., and Lau, F. (2004). Multi-phase minimal fault-tolerant wormhole routing in meshes. Journal of Parallel Computing, Vol. 30, No. 3, March 2004, pp. 423-442.
[58]. Zorpette, G. (1990). The Power of Parallelism. IEEE Spectrum, September 1990, pp. 28-33.
If you have access to this article please login to view the article or kindly login to purchase the article

Purchase Instant Access

Single Article

North Americas,UK,
Middle East,Europe
India Rest of world
USD EUR INR USD-ROW
Pdf 35 35 200 20
Online 35 35 200 15
Pdf & Online 35 35 400 25

Options for accessing this content:
  • If you would like institutional access to this content, please recommend the title to your librarian.
    Library Recommendation Form
  • If you already have i-manager's user account: Login above and proceed to purchase the article.
  • New Users: Please register, then proceed to purchase the article.