# Implementation of Power Gated Design and Leakage Optimization Techniques

### Vikram Kolanu<sup>1</sup>, Dr.Fazal Noor Basha<sup>2</sup>

<sup>1</sup> M-Tech VLSI, Dept. of Electronics & Communication, KL University, AP- INDIA.

Email: <u>vikramkolanu@hotmail.com</u>

<sup>2</sup> A seed into Professor Dept. of Electronics & Communication, KL University, AP

<sup>2</sup> Associate Professor, Dept. of Electronics & Communication, KL University, AP-INDIA.

Email: fazalnoorbasha@kluniversity.in

#### **Abstract**

The basic objective is to implement the complete Place and Route flow for 40nm technology, for a given design by incorporating the best leakage optimization technique which depends on the design block and the instance count in the block. The optimized flow methodology is arrived at by analyzing QoR (Quality of Results) of multiple experimental runs for the different methods employed to reduce the leakage power, thus achieving reduced power consumption and desired timing closure. This includes two important flows PNR and ECO. In PNR flow multiple experimental runs are conducted with different floor plans. Best one of them will be picked for the final ECO flow

**Keywords:** ECO, Power optimization, Leakage optimization, Gated Design.

#### Introduction

As processes shrink to nanometer technology nodes, both of dynamic and static leakage power has become a significant design issue. To resolve the issue, power optimization could be achieved through many efforts contributed from various fields together from software policy, OS, system architecture, logical design, physical implementation, IP/library support and process technology, and so on, as shown in Fig. 1. In general, starting from higher level or earlier stage to set up the power saving plan will have more opportunity to save more power [1].



Figure 1: Power management

Dynamic power is generally proportioned to frequency, switching activity, capacitance, and square of supply voltage. The most effective way to reduce dynamic power is to reduce supply voltage because of the quadratic dependence of power on voltage. Over recent years, some techniques have been developed to take the advantage of lower voltage for power reduction. These techniques include power shut off (PSO), multiple supply voltage (MSV), dynamic voltage scaling (DVS), adaptive voltage scaling (AVS), and so on. Reducing frequency and switching activity can also benefit dynamic power saving significantly. A module may slow down the frequency for some time while it is not required to operate at higher performance. This technique is usually employed together with voltage scaling to optimize the tradeoff between frequency and power by varying voltage. For switching activity reduction, RTL clock gating and architecture clock gating are widely used to restrict the distribution of clocks to those portions that are actually inactive at that time. The total capacitance reduction could come from the process shrinking and well physical implementation such as gate sizing and wire reduction.

Leakage power is increased dramatically while the device feature size shrinks. Designers have to put many efforts on leakage power reduction for extending the stand-by time. Several techniques such as multi-Vt optimization, back basing, power gating, etc., have been developed for leakage power reduction for several years. Power gating becomes very popular in recent years [2]-[5]. It tends to turn off blocks that are not being used through voltage regulators or power switch cells. As shown in Fig. 2, power gating can be implemented through off-chip control or on-chip control. The off-chip power gating turns off the power sources supplied to specific power domains of a chip by a voltage regulator on board. This approach is suitable for long-term power shut-off because it may take long time to restore the power to the gated blocks. The on-chip power gating can turn off the power sources through switch-able

power pads or power switch cells. Turning off a power source by switch-able power pads is quite simple, but it needs extra IO space allocated for power switch-able pads, and is not suitable for pad limited design. In addition, it is inflexible to control power-up ramp up time and rush current of the turned off blocks. For the power gating through power switch cells, MTCMOS is a good solution and is widely used recently, where sleep transistors, usually being high Vt transistors, are controlled by a power management unit to switch off the powers supplies to the gated blocks [2]. When the gated blocks are turned-off in standby mode, only a few leakage powers consume due to the blocks being gated by high Vt sleep transistors. The implementation and verification of MTCMOS power gating are more complicate than those of power gating by switch-able power pads. Several effects such as power-up sequence, rush current, ramp-up time, dynamic IR, etc., and should be taken into account for analysis [3].

The verification of low power design is a big challenge to success. For example, PSO and MSV may fail if there are structural errors such as missing isolation cell or level shifter, incorrect propagation of sleep control, incorrect power domain connection, and so on. Comp rehensive low power verification should include the decision making and design quality check prior to the logic implementation. In addition, it should also include electrical implementation check, power aware formal verification, functional correctness of sleep control, timing closure among multiple corners and multiple modes, IR and EM analysis, etc.

The verification of dynamic IR drop becomes increasingly important for designs at 90nm and below, because simultaneous switching currents may induce peak IR drop in power/ground network and lower the voltage supplied to logic gates, which may reduce logic gate noise margins and result in function or timing failures. Dynamic IR drop is mainly dependent on the peak current of signal switching. The best approach to do the analysis is to simulate the design with peak power simulation patterns and then identify the hot spots. However, it is impractical to get the peak power simulation patterns at early stage due to long time preparation and verification. There is a vector-less heuristic approach proposed to solve this issue, but experimental results show that the approach cannot identify the hot spots correctly, and still need to improve the accuracy. In addition to verification, how to do dynamic IR drop prevention and fixing is another important topic to be resolved.

## **VT Swap Experiment**

#### RVT cells only without leakage optimization flow

In this flow, only RVT cells are used in the initial netlist and the place and route flow is done without using LVT cells. During optimization flow at any stage of the flow only RVT cells and HVT cells are used. The use of LVT cells is limited by setting the constraints on the usage of LVT cells. This flow is considered as the default flow to compare the performance, power and area concerns with respect to other optimization flows used.



Figure 2: RVT cells and without leakage optimization flow

After the place and route flow, for the non-timing critical paths swapping of cells from RVT to HVT cells is performed to reduce the number of RVT cell percentage in the design which reduces the leakage power with High threshold voltage cells. It is observed that in this optimization flow method, the number of cells added during the optimization stages is optimal and the timing issues are also minimal with good area utilization.

#### RVT cells with leakage optimization flow

In this flow as the leakage optimization is enabled during optimization stages, the HVT cells get added on the non-critical timing paths. Here in this flow Multi-Vt libraries are used during all the optimization stages with the constraints imposed on the LVT cells usage.

This optimization methodology results in more percentage of HVT cells being used and better leakage power consumption results with increase in the area utilization. Also it is observed that this flow results in slightly degraded timing paths.



Figure 3: RVT cells and with leakage optimization flow

#### **HVT** cells without leakage optimization flow

After the synthesized netlist in which all the cells are RVT cells, in this flow methodology the initial netlist is hacked and all the RVT cells are changed to HVT cells and is pushed down to the place and route flow stages. In this flow the leakage power optimization is not enabled.



Figure 4: HVT cells and without leakage optimization flow

This optimization resulted in congestion critical and area critical design as only HVT cells were used. These cells added degraded timing to a greater extent due to

which more number of inverters and buffers were added by the tool at optimization stages, resulting in a huge increase in the total instance count of the design. No significant power savings were seen in this case.

### **HVT Cells With Leakage Optimization Flow**

The netlist used for the flow consists of only HVT cells. In this flow method leakage optimization is enabled during all the optimization.



Figure 5: HVT cells and with leakage optimization flow

It is observed that this method resulted in better power savings compared to the flow methodology of HVT cells and without leakage optimization flow. Also it is observed that there was increase in the area utilization without much timing degradation.

This is the final floorplan of a block. All the small blocks shown in figure 10.1 are macros placed in R0 and R180 orientation. Generally macros pins should be faced towards core area so it is preferred to put the macros in specified orientation. And macros also placed near to the edges of the given area to provide more even core area. There are small macro like blocks presented in the core area those are tmac cells which are used as clock root buffers in the top level design.



Figure 6: Floor Plan of experiment block

Tool Used for Floor Plan: Encounter

Std Cell Utilization: 55.8%

No of Macros: 31

## **Area Optimization Experiment**

This experiment is done to reduce the area occupied by the standard cells. In this experiment the cells which are in non-critical path are down sized (drive strength is decreased) to decrease the area of the cells. This is also called as standard cell utilization area reduction.

The width of the cell with drive strength 8ur is 1.5 units is shown below figure. When it is down sized to 4ur its size is reduced to 1unit. Like this all the cells are down sized to reduce the standard cell utilization Area.

As the drive strength increases the transition delay of the cell increases. With this the timing violations will be reported. These timing violation slack can be recovered easily by upsizing the cells which are violating the time.



Figure 7: Standard cell before and after downsize

#### Result

The standard cell utilization area before experiment is 63% The standard cell utilization area after experiment is 60%

## **Insertion Delay Reduction Experiment**

This Experiment is performed after CTS. After CTS if the Insertion delay is more than the required value, it must be reduced in order to avoid timing violations. There are many techniques to reduce clock insertion delay.

### Methods to Reduce Insertion delay are

- Placing the far sitting clock gaters near to the root.
- By increasing the skew limit.
- Changing clock gating cells to better drive strength before cts.
- Create bound for the 1st & 2nd level clock gaters.
- Moving PDLY Cell to center of the tile.
- Improving input transition by adding a big root buffer.
- Sink/buffer tran relaxing to 10% of clock period.

In this project first method is used to reduce the delay. If there is no improvement even after performing first experiment we can proceed to next experiments. In this technique the clock buffers which are near to the main clock buffers are placed near to the main clock buffers to reduce the transition delay between those two buffers.

The figure.8 shows the clock tree structure. Ckbf represents the main clock buffer and ckbf1, ckbuf2.....ckbf6 represents the inserted clock buffers to feed clock to all the cells in the design block. The secondary buffers are ckbf2, ckbf2, ckbf3 which are directly connected to main clock buffers are moved closer to the main clock buffer to reduce the insertion delay.



Figure 8: Clock Tree Structure

#### Result

Insertion delay before experiment was 372.327ps. Insertion delay after the experiment is 265.51 ps. Allowable insertion delay is 300ps

# **Experimental Results**

### **After Final Place and Route Stage**

| Clock Period                  | 1002.5ps |
|-------------------------------|----------|
| No of Violating paths         | 891      |
| Critical Path Levels of Logic | 40       |
| Worst Negative Slack          | -0.239ns |
| Total Negative Slack          | -33.88ns |

## **After Engineering Change Order Stage**

| Clock Period                  | 1002.5ps |
|-------------------------------|----------|
| No of Violating paths         | 0        |
| Critical Path Levels of Logic | 12       |
| Worst Negative Slack          | 0        |
| Total Negative Slack          | 0        |

## Results summary after doing all the experiments

**Table 1:** Results of all experiments

|                     | Before     | After      | Solution                                                      |
|---------------------|------------|------------|---------------------------------------------------------------|
|                     | Experiment | Experiment |                                                               |
| Setup<br>violation  | 891        | 0          | Upsizing standard cells in data path / skewing in clock path  |
| Hold<br>violation   | 103        | 0          | Downsizing standard cells in data path / skewing in data path |
| Insertion delay     | 372.32ps   | 265.51ps   | Pulling clock gaters towards root clock buffer                |
| Skew                | 0.234ns    | 0          | Clearing setup violations                                     |
| Data tran violation | 64         | 0          | Buffer insertion / Increasing driver strength                 |
| Glitches            | 14         | 0          | Upsizing driver strength                                      |
| Data cap violations | 89         | 0          | Buffer insertion / net length reduction                       |

# **Overall Power Optimization**

Below table shows over all power optimization per square mm

 Table 2: Power Report without Optimization

| P & R Stage          | Leakage | Dynamic | Total Power |
|----------------------|---------|---------|-------------|
| Floorplan            | 0.078   | 1.01    | 1.09        |
| Placement            | 0.0919  | 1.145   | 1.23        |
| Clock Tree Synthesis | 0.09    | 1.16    | 1.25        |
| Route                | 0.09    | 1.18    | 1.27        |
| PostRoute            | 0.08    | 1.16    | 1.25        |

 Table 3: Power Report with Optimization

| P & R Stage          | Leakage | Dynamic | Total Power |
|----------------------|---------|---------|-------------|
| Floorplan            | 0.064   | 0.9     | 1.07        |
| Placement            | 0.09    | 1.131   | 1.22        |
| Clock Tree Synthesis | 0.08    | 1.15    | 1.243       |
| Route                | 0.08    | 1.16    | 1.25        |
| PostRoute            | 0.08    | 1.15    | 1.23        |

# **VT Swap Experiment Results**

 Table 4: Vt Swap Experiments report

|                                         | Type of<br>Power | Register<br>Power | Clock<br>Power | Memory<br>Power | Combinational<br>Power |
|-----------------------------------------|------------------|-------------------|----------------|-----------------|------------------------|
| Run1: RVT cells only                    | Dynamic          | 8.97e-03          | 1.37e-02       | 7.92e-02        | 5.82e-02               |
| without<br>leakage<br>optimization      | Static           | 6.14e-03          | 1.69e-02       | 7.55e-04        | 1.15e-02               |
| Run2: RVT cells with                    | Dynamic          | 8.64e-03          | 1.37e-02       | 7.52e-02        | 5.66e-02               |
| leakage<br>optimization                 | Static           | 4.15e-03          | 1.69e-02       | 7.25e-02        | 5.66e-02               |
| Run3: HVT cells only                    | Dynamic          | 8.86e-03          | 1.36e-02       | 7.81e-02        | 5.76e-02               |
| cells only without leakage optimization | Static           | 7.11e-03          | 1.69e-04       | 7.55e-04        | 1.06e-02               |
| Run4: HVT cells with                    | Dynamic          | 8.67e-03          | 1.38e-02       | 7.47e-02        | 5.68e-02               |
| leakage<br>optimization                 | Static           | 3.65e-03          | 1.69e-02       | 7.30e-04        | 5.81e-03               |

#### Conclusion

This could be nevertheless overcome to a certain extent by using power switches in ONO blocks at larger spacing than at very close intervals of distance. With this method, it was also seen that the power consumption of the design also came down to a certain extent. The flows can be further improvised by identifying all the non-critical timing paths and optimizing only those paths by swapping cells HVT cells. This helps in achieving design closure in less number of iterations and with lower power numbers when compared to generic default flow methodologies.

### References

- [1] M. Pedram, "Design technologies for Low Power VLSI", In Encyclopedia of Computer Science and Technology, Vo. 36, Marcel Dekker, Inc., 1997, pp. 73-96
- [2] Sourav Banerjee, Sreeram Chandrasekar, Yogesh Agarwal "Power and Performance Optimization for an Ultra High Performance Mobile Processor using Multiple VT Libraries", SNUG 2010.
- [3] Suresh Raman and Pramod Sripathi, "Achieving power savings through Final Stage Leakage Recovery" SNUG 2011.
- [4] Estelle Fazilleau and Christophe Robichon, "Multi-supply multi-voltage UPF RTL-to-backend flow (Synthesis, Place and Route & Static Verification)" SNUG 2011.
- [5] M. Geetha Priya, K. Baskaran and D. Krishnaveni, "A Novel Leakage Power Reduction Technique for CMOS VLSI Circuits", European Journal of Scientific Research, Vol.74 No.1 (2012), pp. 96-105.
- [6] Ariel Wolf, "Robust Power Gating Implementation using ICC", SNUG Israel 2009.
- [7] Gurudev Bhat Sirsi, "Leakage Power Optimization Flow", International Cadence Usersgroup Conference, Santa Clara, CA, September 13-15, 2004.
- [8] "PHYSICAL DESIGN ESSENTIALS" By Khosrow Golshan.
- [9] Jan M. Rabaey and Massoud Pedram. "Low power design methodologies" Boston, Kluwer Academic, 1996
- [10] "Static Timing Analysis for Nanometer Designs" By J. Bhasker, Rakesh Chadha.
- [11] Prime Time® Fundamentals User Guide Version F-2011.06, September 2011