The enhanced usage of multiplier circuits in various applications has prompted researchers to evolve high-speed and power optimized multiplier design architectures. Subsequently, power management optimizations as well as high speed designs are of greatest concern today, especially while working with digital signal processors, and other audiovideo image processing circuits utilizing artificial neural networks. This work mainly focuses on providing efficient and optimized CMOS designs for 2-bit, 4-bit Wallace Tree Multiplier and Baugh-Wooley Multiplier architectures for applications in advanced filter designs comprised of simple to complex adders' circuits. The proposed architectures are verified experimentally using Mentor Graphics tools targeting the 130 nm technology design parameters. Design architectures are comparatively verified with relevant parameters like rise-time, fall-time, path delay, power dissipation, silicon area, etc. Both the architectures display few tradeoffs between power dissipation and delay parameters. Analysis with the implementation of proposed architectures for usage in Multipliers and Accumulate Carry (MAC) unit and other processing unit blocks are performed for further implementations.
Multipliers play a vital role in majority of the processorbased digital systems, digital signal processing arena, digital communications, spectrum analysis and digital filtering zones. Most of the high performance systems like microprocessors, digital signal processors, etc., include complex multipliers circuits that operate at highest system clock rates. Over 70% of the instructions in various DSP algorithms and microprocessors are comprised of different adders, multipliers and MAC units. The delay parameters associated with these units form the crux in satisfying overall design constraints. Therefore the demand for high-speed processing is tremendously increasing in view of the expanded usage of complex and high-speed multiplier architectures in various applications.
The main objective of Electronic Design Automation (EDA) process is aimed at computerizing entire VLSI design process through number of emerging EDA tools in order to convert HDL design representations into Application Specific Integrated Circuits (ASICs). The global EDA giant companies are also continuously delivering high end automation tools like Cadence Design Systems, Magma Design Automation, Xilinx, Mentor Graphics, Synopsis, etc., for designing electronic systems and integrated circuits. The proposed work is mainly based upon the usage of Mentor Graphics (CMOS-130 nm) technology to experimentally implement Baugh-Wooley and 4 bit Wallace- Tree multiplier architectures. The proposed work is carried out in Complementary Metal Oxide Semiconductor (CMOS) 130 nm technology satisfying various design constraints.
Power management is a pre-dominant factor of concern in various audio-video digital signal/image processing VLSI systems, depending on the computational speed and processing abilities of multipliers employed. Amanollahi and Jaberipur (2017) proposed the design of sequential multipliers using radix-16 CSA (Carry-Save Adders) to generate accumulated partial product. Evaluated results depicted improved latency, power dissipation and energy consumption at the cost of additional silicon area. Antelo et al. (2016) discussed about the optimized binary radix-16 (modified) Booth multipliers to allow optimization of partialproduct- array stage in terms of area/delay/power. The method is extended to radix-8 multipliers, signed/unsigned multipliers etc. Qiqieh et al. (2018) proposed a Significance- Driven Logic Compression (SDLC) approach based on bit significance to form a reduced set of partial product terms. Various parallel multiplication schemes are used to reorganize the designs. Anuar et al. (2010) proposed a twophase clocked adiabatic static CMOS logic circuit based on the principles of adiabatic switching and energy recovery. This architecture can be used in low-power digital devices such as RFIDs (Radio-Frequency Identifications), smart cards, and sensors.
Tang et al. (2017) proposed a 4-bit bit-slice integer multiplier using both signed and unsigned multiplications. The obtained results depict the operation of the multiplier at a throughput of 3.125 × 109 multiplications per second. Moss et al. (2018) presented a radix-4 multiplier suitable for digital filters, artificial neural networks, and other machine learning algorithms. Design is evaluated on an Intel Cyclone-V FPGA (Field-Programmable Gate Array). Performance of multiplier can be improved by the distribution of bit representations. Karthikeyan et al. (2018) proposed implementation of 32-bit Baugh-Wooley multiplier, evaluated in Xilinx ISE 14.5 tool – design is comprised of half adders and full adders. Results clearly deployed a reduction in silicon area and transistor count. Yengade and Indurkar (2017) discussed the implementation of 32 bit Baugh-Wooley multiplier. The design is evaluated using the XILINX ISE 14.5 tool. The power consumption and delay parameters obtained are found superior to the existing ones. Kumawat and Sujediya (2017) proposed a reconfigurable 8 x 8 Wallace Tree multiplier using CMOS and GDI technology. The main idea of implementation is to generate partial products in parallel using AND gates. Significant reduction in power consumption, silicon area and transistor count are observed. Wallace – tree based multipliers provide area efficient strategies for high speed multiplication.
Tiwari (2013) highlights the implementation and verification of low power multiplier and its analysis on reconfigurable devices. Mathew et al. (2013) implemented a design and analysis of an array multiplier using an area efficient full adder cell in CMOS Technology. Alaoui (2011) experimented with design and simulation of a modified architecture for carry save Adder based multiplier. Chandel et al. (2013) discussed about ease of doing multiplication with booths multiplier. Bewick, (1994) and Kuang et al. (2009) discussed about fast multiplication algorithms and implementation.
As a fundamental arithmetic operation, multiplication includes various algorithm-level and bit-level computational features that are not at all considered for low-level power optimizations. The efficiency of today's VLSI systems is totally reliant on the performances of inherent multipliers and computational methodologies adopted. Different serial and parallel high speed multiplier architectures are employed for innumerable number of computing applications, where the area of any multiplier is quadratically related with the operand precision; On the other hand, parallel multipliers have complex architectures and have many logic levels that introduce spurious transitions (or glitches). The structure of parallel multipliers is complex enough to achieve high speed and even deteriorates the efficiency of layout and hinders the circuit level optimization. Therefore, it is very essential to develop algorithm level and architecture level power optimization techniques. To ensure fast multiplication operations, some of the techniques such as Shift and Add Multiplier, Combinational Multiplier, Array Multiplier, Carry Save Adder Multiplier, Booth's Multiplier, Modified Booth's Multiplier, Grid Multiplier, Lattice Multiplier, Vedic Multiplier, Wallace-Tree Multiplier, Baugh-Wooley Multiplier, etc., are employed. Adder circuits and Multiply Accumulate Carry (MAC) units form the crux design units in almost all the multiplier circuits. Over the past few years, researchers have proposed the various modifications of multiplier designs to optimize power consumption, area and enhance the speed as well. Few of the multiplication algorithms are discussed below for further extrapolation.
2's complement is the most popular method of representing signed integers, where negation represents negative numbers using 2’s complements. Baugh-Wooley's 2's complement algorithm for multiplication of signed numbers is famous for its ability to maximize the multiplier's regularity and to allow every partial product to have bits of positive sign. Baugh-Wooley algorithm was developed to design straight (direct) multipliers for 2's complement numbers.
In case of directly multiplying 2's complement numbers, all of the added partial products are signed numbers, each of them are sign extended to fit with the width of final product, thereby formulating the right value of sum through Carry Save Adder (CSA) tree. Baugh-Wooley algorithm provides an efficient method of dealing negatively weighted bits in partial products by adding extra entries in the partial product matrix. Schematic diagram of 2 bit Baugh-Wooley multiplier is depicted in Figures1 to 3.
Schematic diagram and simulation setup of 4-bit Baugh-Wooley multiplier architecture is depicted in Figure 4.
Output waveforms for multiplication of two 4-bit numbers a=4, b=2 are shown in Figure 5. Figures clearly illustrate the increase in hardware, including the number of adders used, owing to the increase in bit size of input numbers to be multiplied.
Wallace-Tree multipliers are basically intended to handle multiplications of large operands by minimizing the number of partial product bits fastly and efficiently by means of carry save adder tree comprised of one bit full adders, where pseudo-adders are used to add the three inputs, and produce two outputs whose sum is equivalent to addition of all the three inputs. Here, several pseudo adders are used concurrently in order to minimize carry propagation and to enhance the speed of multiplication. Wallace-Tree approach minimizes the partial products by having numerous number of input compressors, in concurrence with the resultant delay being proportional to log n (for n number of rows) and accumulating quite a lot 3/2 of partial products in tandem. The entire process of multiplication is summarized below,
Schematic diagram of 4-bit Wallace-Tree multiplier is depicted in Figure 6. Simulation setup for 4-bit Wallace- Tree multiplier is depicted in Figure 7. Output wave forms for a=4, b=-2 are depicted in Figure 8.
A relative analytical study of 4 bit Baugh-Wooley and Wallace-Tree multiplier architectures are better understood by means of Table 1 depicted.
Table 1. Comparative Analysis of 4-bit Baugh-Wooley and Wallace-Tree Multiplier
Henceforth, it was observed that by considering all the parameters, both designs are found significant with each other. However, observations clearly reveal the technical superiority of Baugh-Wooley multiplier over Wallace-Tree multiplier by means of lesser delay values. The total power dissipated by Baugh-Wooley circuit is more than the Wallace-Tree multiplier. The rise time and fall time values of Baugh-Wooley multiplier are observed to be less relative to the Wallace-Tree multiplier.
Based on the calculations relevant to switching process and implicated analysis overhead in transistors, it was observed that Wallace-Tree multiplier exhibits less power dissipation compared to Baugh-Wooley multiplier. Therefore, experimental analysis concludes that Baugh- Wooley multipliers are more preferred for less denser packaging circuit applications producing results with reduced latency values. However, Wallace-Tree multipliers are suitable for high density packaging circuit applications. Thereby, producing better results due to lesser power dissipation.
The proposed architectures can be further modified to minimize the propagation delay values. Even though the proposed works are executed by using 130nm CMOS technology device sizes, it can still be scaled down to the nano-meter range of scales to exploit enhanced performance levels. Though the designs include basic arithmetic blocks like Full Adder, Half Adder, etc., they can be modified through Carry Save Adder, Ripple Carry Adder etc. The circuits can also be implemented using adiabatic static logic. Performances can be studied by varying widths of transistors. Wherever speed of operation is needed, a reduction in a number of logic levels may improve the performance levels, thereby decreasing the delay values and enhancing the speed of propagations involved.