Master Thesis

## Low-Power SRAM Design using Low-Voltage and Low-Swing Techniques

(低電圧、低振幅を用いた低消費電力 SRAM の設計)

February 1, 2002

### Thesis Supervisor Professor Takayasu Sakurai

Department of Electronic Engineering Graduate School of Engineering The University of Tokyo

> Sadaaki Hattori 服部 貞昭

### Contents

| Chapter 1 Introduction                                     | on                                                             | 1  |  |  |  |
|------------------------------------------------------------|----------------------------------------------------------------|----|--|--|--|
| 1.1. Historical pers                                       | pective and future trends                                      | 1  |  |  |  |
| 1.2. Low-power CMOS and SRAM design                        |                                                                |    |  |  |  |
| 1.3. Objective of research5                                |                                                                |    |  |  |  |
| 1.4. Chapter organi                                        | zation                                                         | 6  |  |  |  |
| Chapter 2 Write powe                                       | er saving scheme SRAM                                          | 7  |  |  |  |
| 2.1. Introduction                                          |                                                                | 7  |  |  |  |
| 2.2. Sense-amplifyi                                        | ng cell (SAC) scheme                                           | 10 |  |  |  |
| 2.2.1. Write operation using low-swing bit-line            |                                                                | 10 |  |  |  |
| 2.2.2. Write volta                                         | nge generator                                                  | 12 |  |  |  |
| 2.2.3. Write powe                                          | er saving                                                      | 15 |  |  |  |
| 2.3. Design Conside                                        | erations                                                       | 18 |  |  |  |
| 2.3.1. Read acces                                          | s time                                                         | 19 |  |  |  |
| 2.3.2. Noise mar                                           | gin                                                            | 20 |  |  |  |
| 2.3.3. Area overh                                          | nead                                                           | 21 |  |  |  |
| 2.4. Experiment res                                        | sults and discussions                                          | 23 |  |  |  |
| 2.5. Summary                                               |                                                                | 27 |  |  |  |
| Chapter 3 Power save                                       | ng scheme for peripheral circuits and decoders                 | 28 |  |  |  |
| 3.1. Introduction                                          |                                                                | 28 |  |  |  |
| 3.2. Bypass level co                                       | nverter (BLC) and revised pass transistor type level converter | 30 |  |  |  |
| 3.3. Replica-biased                                        | level converter                                                | 35 |  |  |  |
| 3.4. Fabrication                                           |                                                                |    |  |  |  |
| 3.5. Summary                                               |                                                                |    |  |  |  |
| Chapter 4 Write powe                                       | er saving scheme for register files                            | 44 |  |  |  |
| 4.1. Introduction                                          |                                                                |    |  |  |  |
| 4.2. Sense-amplifying cell (SAC) scheme for register files |                                                                |    |  |  |  |
| 4.3. Dual-VDD architecture for register files              |                                                                |    |  |  |  |
| 4.4. Summary                                               |                                                                | 54 |  |  |  |
| Chapter 5 Conclusion                                       | 1                                                              | 55 |  |  |  |
| Appendix A Power a                                         | and read access time estimation                                | 57 |  |  |  |
| Appendix B Details of test chip design                     |                                                                |    |  |  |  |
| References                                                 |                                                                |    |  |  |  |
| Acknowledgements                                           |                                                                |    |  |  |  |
| List of Publications                                       |                                                                |    |  |  |  |

#### 1.1. Historical perspective and future trends

During the past 30 years, MOS large-scale-integration circuits (LSI's) have made great progress. They appeared in the early 1970's with the introduction of the first microprocessor by Intel (the 4004) [1]. This was 0.75-MHz processor implemented in 10-µm technology and was composed of only 2,300 transistors. Now, 2-GHz microprocessors are produced using 0.18-0.13-µm technologies. With the huge number of transistors in a chip and its extremely high operating speed, LSI's can execute relatively intelligent tasks at moderate cost. Numerous portable devices such as portable phones and note PC's could not have appeared without the remarkable progress of LSI's. In the early decades of twenty-first century, the continued progress of LSI's is expected to cause wide-ranging social and cultural changes affecting economy, industry, transportation, communication, education, medical care, amusement, our life styles, and so on.

The downsizing of components has driven progress of LSI's. By the downsizing of MOSFET's, the number of the transistors in a chip increases and the functionality of LSI's is improved, which leads to improvement of operation speed. Implementation of downsizing has resulted in obedience to established common rule – the number of transistors in a chip increases by four times every three years in accordance with Moore's law, which has held true for more than 25 years. Fig. 1.1 shows trends in the device count per memory chip and microprocessor chip during the past 30 years [2]. As can be observed, integration complexity gets four times approximately every three years.



Fig. 1.1 Trends in the device count per chip.

The downsizing of LSI' is expected to continue at least for the next 10 years. The latest future trends of semiconductor chips – The 2001 Edition of the International Technology for Roadmap for Semiconductors (ITRS) [3] - is reported by Semiconductor Industry Association (SIA), as shown in Table 1.1.

| Year                 | 2001 | 2003 | 2005 | 2007 | 2010 | 2013 | 2016 |
|----------------------|------|------|------|------|------|------|------|
| Technology node (nm) | 130  | 100  | 80   | 65   | 45   | 32   | 22   |
| Gate Length (nm)     | 90   | 65   | 45   | 35   | 25   | 18   | 13   |
| V <sub>dd</sub> (V)  | 1.1  | 1.0  | 0.9  | 0.7  | 0.6  | 0.5  | 0.4  |
| Frequency (GHz)      | 1.7  | 3.1  | 5.2  | 6.7  | 11.5 | 19.3 | 28.8 |
| Power *1 (W)         | 130  | 150  | 170  | 190  | 218  | 251  | 288  |
| Power *2 (W)         | 2.4  | 2.8  | 3.2  | 3.5  | 3.0  | 3.0  | 3.0  |

Technology node: DRAM Half-Pitch

Gate Length: Printed Gate Length of Microprocessor,

V<sub>dd</sub>: Power Supply Voltage (high performance),

Frequency: Chip Frequency (On-chip local clock),

Power \*1: Allowable Maximum Power for high-performance desktop applications,

Power \*2: Allowable Maximum Power for portable battery operations

#### Table 1.1 ITRS Road Map.

This latest edition of ITRS expects smaller chips than previously thought. In the previous roadmap released in 1999, it called for the future generations of DRAM to feature critical dimensions of 100 nm in 2005. Now the industry plans to deliver 80 nm in 2005. As scaling goes down rapidly, chip frequency is also expected to increase at high increase rate.

On the other hand, however, the following serious concerns will arise with 100-nm technology and below [4].

- 1. Much production cost due to the increase of process steps and the increase of equipment price.
- 2. Saturation of the operating speed of LSI due to signal and clock propagation delay in the long and dense interconnections.
- 3. Degradation of the yield and reliability of LSI's due to the huge number of transistors in a chip; also, it will become difficult to keep the uniformity of the electrical characteristics of the huge number of transistors in a chip.
- 4. Increase of the power consumption and heat generation of a chip due to the huge number of transistors.

The fourth one is particularly expected to a fatal problem. Looking at Table 1, maximum power for high-performance desktop applications, 130 W in 2001 is estimated to go up to 288

W in 2016 despite the use of a lower supply voltage. Maximum local heat density in a chip is expected to be as much as that in a nuclear reactor, which sets a strict limit on the reliability of LSI's. The concern of increase in power consumption is bigger in case of battery operated portable devices because the allowable maximum power is strictly limited due to small device size. The next generation LSI's are suffering from power crisis and it's obvious that further progress of LSI's could not be achieved without power saving approaches.

#### 1.2. Low-power CMOS and SRAM design

LSI's are mainly composed of both logic circuits and memory circuits. First, in modern digital logic circuits, power consumption can be attributed to three main components: dynamic switching power ( $P_{dynamic}$ ), short circuit power ( $P_{sc}$ ), and leakage power ( $P_{leak}$ ), as shown in (1.1):

$$\boldsymbol{P} = \boldsymbol{P}_{dynamic} + \boldsymbol{P}_{sc} + \boldsymbol{P}_{leak} = \boldsymbol{p}_t \boldsymbol{f}_{CLK} \boldsymbol{C}_L \boldsymbol{V}_S \boldsymbol{V}_{DD} + \boldsymbol{p}_t \boldsymbol{f}_{CLK} \boldsymbol{I}_{sc} \varDelta \boldsymbol{t}_{sc} \boldsymbol{V}_{DD} + \boldsymbol{I}_0 \boldsymbol{10}^{-(\boldsymbol{V}_{TH} / S)} \boldsymbol{V}_{DD}$$
(1.1)

where  $p_t$  is the switching probability,  $f_{CLK}$  is the clock frequency,  $C_L$  is the total effective switched capacitance,  $V_S$  is the signal voltage swing,  $V_{DD}$  is the supply voltage,  $I_{sc}$  is the average short circuit current,  $\Delta t_{sc}$  is the time when short circuit current flows,  $I_0$  is a constant which is proportional to total transistor in a chip,  $V_{TH}$  is the threshold voltage, and S is the subthreshold swing. The first term represents dynamic power consumption due to charging and discharging of the load capacitance, and the second term is due to the direct-path short circuit current, and the third term is leakage current consumption due to subthreshold conduction. Dynamic switching power is the dominant component of power consumption in modern digital logic circuits, and in most cases Vs is equal to V<sub>DD</sub> except some logic circuits; hence power consumption of digital logic circuits is expressed as

$$\boldsymbol{P} = \boldsymbol{\rho}_t \boldsymbol{f}_{CLK} \boldsymbol{C}_L \boldsymbol{V}_{DD}^2 \tag{1.2}$$

Then lowering  $V_{DD}$  is the most effective way to achieve low-power performance. Certainly, scaling down  $C_L$  or  $f_{CLK}$  in (1.2) also contributes to low-power operation. Decreasing  $C_L$ , however, would be difficult without scaling down the device and wiring, and low  $f_{CLK}$  usually degrades throughput performance. Although there have been attempts to lower  $f_{CLK}$  by introducing parallel processing, this approach generally increases hardware overhead and requires extensive reworking at an architecture or algorithm design level [5].

Lowering  $V_{DD}$  is effective in lowering power dissipation, however, it is generally difficult because the speed performance is dramatically reduced at lower supply voltages. The gate delay time ( $t_{pd}$ ) is approximately given by

$$t_{pd} = \frac{kC_L V_{DD}}{\left(V_{DD} - V_{TH}\right)^{\alpha}}$$
(1.3)

where k is a constant,  $C_L$  is the load capacitance,  $V_{DD}$  is the supply voltage,  $V_{TH}$  is the threshold voltage, and  $\alpha$  is approximately 1.3 according to alpha-power low MOSFET model [6]. In the above expression, lowering the  $V_{DD}$  decreases  $V_{DD}$  -  $V_{TH}$ , which results in a drastic increase in gate delay. Fig. 1.2 (a) shows gate delay dependence and dynamic power dependence on  $V_{DD}$  when 0.35-µm technology is supposed. Dynamic power consumption decreases as  $V_{DD}$  goes down, however, gate delay increases dramatically.

One way to overcome the speed degradation problem is to reduce V<sub>TH</sub> in (1.3), however, another significant problems happens – a rapid increase in leakage power in (1.1). Fig. 1.2 (b) shows gate delay dependence and subthreshold leakage power dependence on V<sub>TH</sub>, assuming  $I_0 = 5 \times 10^{-6}$  and S = 80 mV / decade in (1.1). Gate delay decreases as V<sub>TH</sub> goes down, however, subthreshold leakage power increases rapidly.



Fig. 1.2 (a) Gate delay dependence and dynamic power dependence on V<sub>DD</sub>. (b) Gate delay dependence and subthreshold leakage power dependence on V<sub>TH</sub>.

And one more significant problem emerges at low V<sub>TH</sub>, which results from process fluctuation. Threshold voltage fluctuates typically by  $\pm 0.1$ V, which causes large gate delay variation at low V<sub>TH</sub>.

The issue of the trade-off between power and gate delay is to be solved by optimizing both V<sub>DD</sub> and V<sub>TH</sub> effectually. Multiple threshold voltage scheme [7-12] and variable threshold voltage scheme [13-14] manage V<sub>TH</sub>. Multiple supply voltage scheme [15] and variable supply voltage scheme [16] manage V<sub>DD</sub>. Dynamic voltage scaling is lately a well-known technique for actual processor [17].

SRAM that features high speed and ease of use, despite high cost, is a main component of LSI's and it is found in the main memory of supercomputers, the cash memory in mainframe computers, workstations, microprocessors, and memory in handheld equipment. Low-power approach for SRAM is a more emergent concern due to its feature of large capacity, however, it is difficult to cut power consumption by reducing both V<sub>DD</sub> and V<sub>TH</sub>, unlike in the logic circuits. If V<sub>TH</sub> of MOSFET's in memory cells is reduced, leakage power from the huge number of cells becomes extensively higher. But low-voltage operation with high V<sub>TH</sub> MOSFET's degrades performance and data retention reliability.

Therefore different approaches to cut power dissipation in SRAM should be taken and a lot of challenges have been reported. The power sources of SRAM's are memory cell arrays, data-line loads, column/sense circuits, and the other peripheral circuits. Divided word-line (DWL) structure [18-19], which reduces both column current and decoder delay, is typically adopted in the modern SRAM's. Cell-driving schemes such as boosted word-line scheme [20 -21], raised V<sub>DD</sub> [22-23] scheme, and applied source-line scheme [24-25] increases the cell margin at low V<sub>DD</sub>. Single bit-line SRAM [26] has been proposed to decrease layout area and power dissipation in column/sense circuits. Latch-type sense amplifiers such as PMOS cross-coupled amplifiers [27] greatly reduce the dc current after amplification and latching. Current Sense amplifiers [28] permit a small voltage swing on the bit-line, reducing the time for changing the bit-line voltage as well as the bit-line power dissipation. Divided bit-line [29] reduces bit-line capacitance, affording a low power and high speed. Using Low-V<sub>TH</sub> devices for the peripheral circuits [30] and for the memory cell [31-32] is also effective way to ultralow-voltage operation. Ultralow-voltage SRAM that operates at V<sub>DD</sub> of around 0.5V has been reported using Low-V<sub>TH</sub> MOS [33].

#### 1.3. Objective of research

Power reduction in SRAM circuits is an emergent concern in future generation LSI's since it is reported that the power in SRAM occupies approximately 30% of the total power in a chip. SRAM is mainly composed of memory cell arrays and peripheral circuits. A bit-line in memory cell arrays has large capacitance due to the parasitic capacitance by transistors and due to the parasitic capacitance by interconnections. Thus, power dissipation in write cycles is particularly dominant in continuous write and read operation due to full-swing nature of bit-line. Power dissipation in bit-line occupies 83% of total power in write cycles, assuming 4-Mbit SRAM at bit width of 256, however, there is no solution to carry out drastic write power saving without large penalty. Therefore, objective of this research is to achieve low-power operation in write cycles without any complicated circuits. Two key techniques for write power saving has been proposed in this work: Sense-amplifying cell (SAC) scheme using low-swing bit-line reduces write power by approximately 80% compared to conventional scheme. And dual-V<sub>DD</sub> SRAM architecture with high-speed level converter cuts the half of power in peripheral circuits.

Power reduction in write cycles is estimated by HSPICE simulation, assuming 4-Mbit on-chip SRAM. And test chips are fabricated in order to verify operation and to obtain measurement results.

#### 1.4. Chapter organization

Chapter 2 presents a write power saving method using sense-amplifying memory cell (SAC) scheme. A new write operation scheme to reduce power consumption in write cycles has been proposed. Reduced write power is estimated by HSPICE simulation, assuming 4-Mbit on-chip SRAM, and the design is considered regard to read delay, static noise margin, and area overhead. Two 64-Kbit test chips have been fabricated and measurement results are obtained.

Chapter 3 is concerned with power saving scheme for peripheral circuits and decoders. Dual-V<sub>DD</sub> SRAM architecture, which enables power reduction in peripheral circuits and decoders, is proposed by introducing high-speed replica-biased level converter. Power reduction of SRAM where SAC scheme and dual-V<sub>DD</sub> scheme mixed is estimated by HSPICE simulation assuming 4-Mbit SRAM. Two test chips have been fabricated: four types of level converters and 2-Kbit dual-V<sub>DD</sub> SRAM with replica-biased level converter.

Chapter 4 is devoted to application of both SAC scheme and dual-V<sub>DD</sub> scheme to register file. A test chip with two types of 1-Kbit register files using SAC scheme has been fabricated. And 16-Word  $\times$  16-bit register file with replica-biased level converter has been also fabricated.

Finally, chapter 5 presents the conclusions.

#### 2.1. Introduction

An SRAM is continued to be an important building block of System-on-a-Chip's. Low-power feature for on-chip SRAM's is getting more important for mobile applications. Lowering both the supply voltage ( $V_{DD}$ ) and the threshold voltage ( $V_{TH}$ ) is the most effective way to cut power dissipation in logic circuits, however, it is difficult to take this solution for SRAM's as described in chapter 1.

On-Chip SRAM's tend to have large number of bit width such as 16 to 256 or even greater. In this type of SRAM's, the power of SRAM is dissipated mainly by charging and discharging of the bit-line due to full swing nature of the bit-line. Fig. 2.1 shows conventional SRAM circuit and its write cycle waveforms. A bit line has normally large capacitance due to huge number of pass transistors connected to the bit line and due to its long wire. In the conventional SRAM, a pair of the bit lines is charged from ground to  $V_{DD}$  and one of the bit lines is discharged before write operation. The power consumption in charging is  $f \times C_{BL} \times V_{DD^2}$ , where f is the clock frequency,  $C_{BL}$  is the load capacitance and  $V_{DD}$  is the supply voltage. The power consumption in write cycles is much larger than that in read cycles because bit-line swings normally by only 100 mV in read cycles.



Fig. 2.1 (a) Circuit Schematic of conventional SRAM. (b) Write cycle waveforms.

Therefore, reducing bit-line swing is the most effective way to reduce the power dissipation in write cycles and a half swing technique has been reported [34]. In the half swing scheme, 75% power reduction was achieved by restricting the bit-line swing to a half of V<sub>DD</sub> and by using charge recycling technique between positive and negative half-swing. Fig. 2.2 (a) and (b) show half swing SRAM and its write cycle waveforms. Bit lines are charged to half V<sub>DD</sub> instead of full charge in conventional SRAM. Fig. 2.2 (c) shows a logic gate using charge cycling method. Each signal swings by half V<sub>DD</sub> and charge is recycled between positive pulse and negative pulse. The charge in the bit-line is also recycled with write data bus. Then the power consumption in charging is  $(1/2)\times f\times C_{BL}\times V_{DD}\times (V_{DD}/2)$ . It is, however, difficult to further reduce the power consumption by reducing the bit-line swing due to write-error problem in this scheme. What is more, if the write and read cycles come alternately, there is additional power due to the mismatch of the precharge level of bit-line in read cycles and that in write cycles. Read operation at the bit-line potential of V<sub>DD</sub>/2 causes read-error, thus precharge level in read cycles must be above V<sub>DD</sub>/2. One more issue associated with the half swing technique is that complex circuits are required to charge recycling, which results in large area overhead and increase in cost.



Fig. 2.2 (a) Circuit Schematic of half swing SRAM. (b) Write cycle waveforms. (c) Half swing pulse mode AND gate.

Driving source-line (DSL) scheme [25] achieved lower-swing bit-line operation than half V<sub>DD</sub> swing. A source line, connected to the source terminals of driver MOSFET's is controlled so that it is negative and floating in read and write cycles, respectively as shown in Fig. 2.3. When the source-line voltage is driven negative, the p-n source-substrate junction is forward biased and the threshold voltage is reduced. And a negative source-line decreases the cell node potentials, which result in lowered source-terminal voltage in the transfer MOSFET's. Thus, the gate-to-source voltage is boosted and shorter access time can be obtained. In write cycle, the source-line is floating during the time word-line is high, which makes it easy to invert the node potentials of memory cells. Therefore a small swing of bit-line is adequate to

control the node potential.

Power consumption for charging bit-line at 1.0 V and 100 MHz is reduced from 0.36 W to 0.03 W. However, the difficulty in realizing DSL scheme can be expected. First, the source-line should be divided in order to avoid a problem of electro migration resulted from huge number of cells connected to the source-line and in order to prevent the loss of node potential of unselected cells in write cycles. Separated source-lines needs the same number of complex source driver to produce negative voltage, but it is difficult to locate plural complex source driver in a row of memory cell arrays. Second, boosted gate-to-source voltage degrades device reliability, particularly in future much scaled down LSI's. Third, write operation at precharge level of half VDD increases the probability of write-error and precharge level in read cycles should be above VDD to reduces the probability of read-error. It is also one of questionable points that there is no device implementation.



Fig. 2.3 (a) Circuit schematic of DSL scheme. (b) Read cycle waveforms of DSL and conventional cells. (c) Write cycle waveforms of DSL and conventional cells.

In this work, low-swing bit-line operation is realized by introducing an additional NMOS in a memory cell, not by applying negative voltage to the source-line. There is no problem of electro migration and proposed scheme SRAM is operated without any complex circuits.

## 2.2. Sense-amplifying cell (SAC) scheme2.2.1. Write operation using low-swing bit-line

Fig. 2.4 (a) shows the circuit diagram of the proposed sense-amplifying cell (SAC) scheme. The salient feature of the scheme is an additional NMOS connected to the source terminal of driver transistors in a memory cell, which enables a write operation with low-swing bit-line. If bit-line swing is denoted as  $\Delta V_{BL}$ , the power consumed in charging of the bit-line is  $f \times C_{BL} \times V_{DD} \times \Delta V_{BL}$ . A pair of bit-lines is precharged to  $V_{DD}$ - $V_{TH}$  by an NMOS load and one of bit-lines is pulled down to  $V_{DD}$ - $V_{TH}$ - $\Delta V_{BL}$  in a write '0' operation. The write voltage VwR equal to  $V_{DD}$ - $V_{TH}$ - $\Delta V_{BL}$  is prepared by a DC-DC converter with a help of write voltage generator. Thus the power consumption in charging of the bit-line is  $f \times C_{BL} \times \Delta V_{BL}^2$ . Assuming  $\Delta V_{BL} = 1/6$   $V_{DD}$ , write power in charging of the bit-line can be reduced to theoretically 1/36. Moreover, long I/O data-line is also precharged to  $V_{DD}$ - $V_{TH}$  and pulled down to  $V_{WR} = V_{DD}$ - $V_{TH}$ - $\Delta V_{BL}$ , thus power consumption in charging of I/O data-lines is also reduced as well as that in bit-line.



Fig. 2.4 (a) Circuit schematic of Sense-amplifying cell (SAC) scheme SRAM. (b) Write cycle waveforms

Fig. 2.4 (b) shows the timing chart of SAC scheme in a write cycle. The NMOS switch connected to the source terminal of driver transistor is turned off by a signal (SL) before a word-line is accessed in a write cycle. Even if the voltage difference between a pair of bit-lines is small, the cell node potentials can be inverted because the driver transistors do not draw current while the word-line is activated thanks to the NMOS switch. Being different from the half-swing technique, there is no additional power consumption even if the write and read cycles come alternately, because there is no mismatch between the level of bit-line in read cycles and that in write cycles. There is no read-error because the precharge level in a read cycle is above  $V_{DD}/2$ .



Fig. 2.5 Simulated waveforms of SAC scheme SRAM in a write cycle.



Fig. 2.6 Simulated waveforms of SAC scheme SRAM in a read cycle.

Fig. 2.5 shows simulated waveforms of SAC scheme SRAM in a write cycle by HSPICE simulation. Cell node potentials (A, B) are inverted properly at  $\Delta V_{BL} = 0.25V$  (V<sub>BL</sub> = 0.80V,

 $V_{BL,bar} = 0.55V$ ). Fig. 2.6 shows simulated waveforms of SAC SRAM in a read cycle. A latch-type sense amplifier is assumed as shown in Fig. 2.7. There is no difference between read operation in SAC SRAM and that in conventional SRAM. Output data is obtained when the voltage difference of a pair of bit-line reaches 100 mV.



Fig. 2.7 Circuit schematic of a latch-type sense amplifier.

#### 2.2.2. Write voltage generator

The precharge level of bit-line must not be V<sub>DD</sub> because access transistors of the cell cannot turn on in a write operation in this scheme. Therefore bit-line is precharged to V<sub>DD</sub>-V<sub>TH</sub> by an NMOS load and is pulled down to V<sub>DD</sub>-V<sub>TH</sub>- $\Delta$ V<sub>BL</sub> in a write '0' operation.  $\Delta$ V<sub>BL</sub> must be independent of V<sub>TH</sub> fluctuation in order to assure stable write operation. Fig. 2.8 shows various types of write voltage generator that makes the voltage V<sub>WR</sub> equal to V<sub>DD</sub>-V<sub>TH</sub>- $\Delta$ V<sub>BL</sub> even in the presence of the V<sub>TH</sub> fluctuation.

Fig. 2.8 (a) is an ideal write voltage generator. If the gate width of each NMOS is denoted  $W_1$ ,  $W_2$  and  $W_3$ , and the current of a current source is denoted I, next equations are given,

$$\beta W_{l} (V_{DD} - V_{TH}^{\prime} - V_{WR})^{\alpha} = \beta W_{2} (V_{l} - V_{TH})^{\alpha}$$

$$(2.1)$$

$$I = \beta W_2 (V_1 - V_{TH})^{\alpha}$$
(2.2)

where  $\beta$  and  $\alpha$  is the constant, V<sub>DD</sub> is the supply voltage, V<sub>TH</sub> is the threshold voltage, and V'<sub>TH</sub> is the effective threshold voltage of nMOS whose gate is pulled up to V<sub>DD</sub>. The gate length of all the nMOS's is defined as the same. Extracting the  $\alpha$ th root of (2.1) and (2.2),

$$V_{DD} \downarrow V_{WR} = V_{DD} - V_{TH} - \Delta V_{BL}$$

$$W_{2} \downarrow U_{WR} = V_{DD} - V_{TH} - \Delta V_{BL}$$

$$W_{2} \downarrow U_{WR} = V_{DD} - V_{TH} - \Delta V_{BL}$$

$$W_{2} \downarrow U_{WR} = V_{DD} - V_{TH} - \Delta V_{BL}$$

$$W_{2} \downarrow U_{WR} = V_{DD} - V_{TH} - \Delta V_{BL}$$

$$W_{2} \downarrow U_{WR} = V_{DD} - V_{TH} - \Delta V_{BL}$$

$$W_{2} \downarrow U_{WR} = V_{DD} - V_{TH} - \Delta V_{BL}$$

$$W_{2} \downarrow U_{WR} = V_{DD} - V_{TH} - \Delta V_{BL}$$

$$W_{2} \downarrow U_{WR} = V_{DD} - V_{TH} - \Delta V_{BL}$$

$$W_{2} \downarrow U_{WR} = V_{DD} - V_{TH} - \Delta V_{BL}$$

$$W_{2} \downarrow U_{WR} = V_{DD} - V_{TH} - \Delta V_{BL}$$

$$W_{2} \downarrow U_{WR} = V_{DD} - V_{TH} - \Delta V_{BL}$$

$$W_{2} \downarrow U_{WR} = V_{DD} - V_{TH} - \Delta V_{BL}$$

$$W_{2} \downarrow U_{WR} = V_{DD} - V_{TH} - \Delta V_{BL}$$

$$W_{2} \downarrow U_{WR} = V_{DD} - V_{TH} - \Delta V_{BL}$$

$$W_{2} \downarrow U_{WR} = V_{DD} - V_{TH} - \Delta V_{BL}$$

$$W_{2} \downarrow U_{WR} = V_{DD} - V_{TH} - \Delta V_{BL}$$

Fig. 2.8 Circuit schematic of various types of reference voltage generator (VwR generator).
(a) Ideal VwR generator.
(b) VwR generator using resister.
(c) VwR generator using PMOS whose gate is pulled down to ground.
(d) VwR generator using NMOS diode.
(e) VwR generator using PMOS diode.

$$\beta' \omega_1 (\mathbf{V}_{DD} - \mathbf{V}_{TH}' - \mathbf{V}_{WR}) = \beta' \omega_2 (\mathbf{V}_1 - \mathbf{V}_{TH}) \qquad \beta' = \beta^{\frac{1}{\alpha}}, \quad \omega_1 = \mathbf{W}_1^{\frac{1}{\alpha}}, \quad \omega_2 = \mathbf{W}_2^{\frac{1}{\alpha}} \quad (2.3)$$
$$\mathbf{I}' = \beta' \omega_2 (\mathbf{V}_1 - \mathbf{V}_{TH}) \qquad \mathbf{I}' = \mathbf{I}^{\frac{1}{\alpha}} \quad (2.4)$$

and if (2.3) is rearranged,

$$V_{WR} = V_{DD} - V_{TH}' - \frac{\omega_2}{\omega_1} (V_1 - V_{TH}).$$
(2.5)

From (2.4) and (2.5), we have

$$V_{WR} = V_{DD} - V_{TH}' - \frac{I'}{\omega_I \beta'}$$
(2.6)

$$\Delta V_{BL} = \frac{I'}{\omega_I \beta'} \quad \text{(const.).} \tag{2.7}$$

In (2.7),  $\Delta V_{BL}$  is independent of V<sub>TH</sub>, which means  $\Delta V_{BL}$  do not change even if V<sub>TH</sub> fluctuates. Fig. 2.8 (b), (c), (d) and (e) show write voltage generators where the current source is substituted with various kinds of elements. These voltage generators are not independent of V<sub>TH</sub> fluctuation. Fig. 2.9 shows simulated results of  $\Delta V_{BL}$  dependence on the V<sub>TH</sub> fluctuation. Normal  $\Delta V_{BL}$  without V<sub>TH</sub> fluctuation is assumed 0.25 V at V<sub>DD</sub> of 1.5 V. An ideal voltage generator (type (a)) is almost independent of V<sub>TH</sub> fluctuation, though slight  $\Delta V_{BL}$  variation is observed because of linear characteristic of I-V curve of MOSFET in a saturation region. Type (d) and type (e) are deeply affected by V<sub>TH</sub> fluctuation. From Fig. 2.9, write voltage generator using resistor (type (b)) is observed as the best write voltage generator. When V<sub>TH</sub> is fluctuated by  $\pm 0.15$  V,  $\Delta V_{BL}$  fluctuation can be kept as low as  $\pm 30$  mV. But there is resistance fluctuation in the write voltage generator using resistance. Fig. 2.10 shows  $\Delta V_{BL}$ dependence of type (b) write voltage generator on the V<sub>TH</sub> fluctuation and on the R fluctuation, as the resistance is denoted R. When R is fluctuated by 15%,  $\Delta V_{BL}$  fluctuation can be kept as low as  $\pm 20$  mV. Therefore, the write voltage generator using resistance is the most proper one to make write voltage V<sub>WR</sub>. V<sub>WR</sub> is used as a reference voltage in the DC-DC converter and the converter supplies V<sub>WR</sub> to each bit-line through a write circuit.



Fig. 2.9  $\Delta V_{BL}$  dependence on the V<sub>TH</sub> fluctuation.



Fig. 2.10  $\Delta V_{BL}$  dependence on the V<sub>TH</sub> fluctuation and on the R fluctuation.



Fig. 2.11 Power dissipation in memory cell arrays in write cycles as a function of  $\Delta V_{BL}$ .



Fig. 2.12 Required time to invert cell node potential since word-line is activated as a function of  $\Delta V_{BL}.$ 

Fig. 2.11 shows simulation results of power dissipation in memory cell arrays operated at 100 MHz in 0.35-µm technology when 4-Mbit SRAM is assumed and bit width is 256. Power dissipation is estimated as the sum of the power in charging bit lines and that in short circuit current in a write operation. Half-swing scheme SRAM saves power dissipation by 75% by

using half swing bit lines and charge recycling. And proposed sense-amplifying cell (SAC) scheme reduces power dissipation to approximately 1/30 at  $\Delta V_{BL}$  of 0.25 V. The minimum value of  $\Delta V_{BL}$  is 0.10 V and write power in memory cells can be reduced to nearly 1/115 at  $\Delta V_{BL}$  of 0.10 V. However, the required delay to invert cell node potentials increase. Fig. 2.12 shows the required delay time to invert cell node potential since word-line is activated. Considering power and delay,  $\Delta V_{BL} = 1/6 V_{DD} (\Delta V_{BL} = 0.25 \text{ V} \text{ at } V_{DD} = 1.5 \text{ V})$  is chosen as the best condition in a write operation in this scheme.

Fig. 2.13 shows simulated results of total power dissipation of assumed 4-Mbit on-chip SRAM in a write cycle and in a read cycle versus bit width. The more the bit width is, the more the total write power is saved because the power consumed by bit-line charge and discharge becomes more dominant compared with the power of the other circuits when the bit width gets larger. When the bit width is 256, total write power is saved by 90% and 67 % compared with the conventional full swing scheme and half swing scheme, respectively. Power consumption in a read cycle is larger than that of SAC scheme in a write cycle because the power dissipation in charging bit-line in SAC scheme is theoretically 1/36 of charging full-swing bit-line while power dissipation in charging bit-line in a read cycle is 1/15 of charging full-swing at the bit-line of swing of 0.1 V when  $V_{DD} = 1.5$  V.



Fig. 2.13 Total power dissipation of assumed 4-Mbit SRAM in a write cycle and in a read cycle versus bit width.

The simulated result of total write power dissipation in Fig. 2.13 is, however, not exact estimation because huge number of data bus lines is not considered. But it is difficult to simulate power consumption of all the data bus lines and peripheral circuits. Then total write power is calculated: the detail of power estimation is described in appendix A. The proportion of the power dissipation of SAC scheme to that of conventional full-swing scheme depends on the ratio of the bit-line capacitance to the data bus line capacitance, as shown in Fig. 2.14. When the ratio is assumed to be 4, total write power dissipation in SAC scheme can be saved by approximately 81% compared to that of conventional full-swing scheme.



Fig. 2.14 Power dissipation ratio dependence on capacitance ratio.



Fig. 2.15 Calculated total power dissipation of assumed 4-Mbit SRAM in a write cycle and in a read cycle versus bit width.

#### 2.3. Design Considerations

SAC scheme SRAM enables approximately 81% write power saving, however, performance degradation should be discussed because of an additional NMOS switch connected to the source terminal of driver transistor. NMOS switch can be actually shared by N (N = 2, 4, 8) cells in order to reduce overhead as shown in Fig. 2.16. N should not be too large in order to avoid the problem of electro migration. The current of source-line shared by N memory cells gets N times larger than that of normal separated cell.



Fig. 2.16 NMOS switch is shared by N cells.

If the channel width of NMOS is denoted as Wsw, the effective channel width of the NMOS switch per cell is Wsw/N. The channel width of a driver transistor is set three times as large as that of an access transistor, and parameter  $\beta$  signifies a ratio of the channel width of the NMOS switch per cell to the channel width of the access transistor. Parameter  $\beta$  of infinity corresponds to conventional cell without NMOS switch. Read access time, noise margin and area overhead are discussed as a function of parameter  $\beta$  and N.

#### 2.3.1. Read access time

Fig. 2.17 shows simulated results of read delay as a function of parameter  $\beta$  and N. Read delay of conventional scheme is plotted in the same figure. Bit-line delay is the time for the difference of a pair of bit-line voltages to get to 100 mV since word-line is activated. And total access time is from address buffer input to output buffer output placed after a sense amplifier. Sensing time is simulated and addressing time and buffering time of output data is calculated: the detail of calculation is described in appendix A.1. The bit-line delay at  $\beta$  of 1, in the worst case, is 2.08 ns while that of conventional scheme is 1.24ns. But this is not significant increase in read delay since bit-line delay is only from 15% to 25% of total access time and the rest of time except the bit-line delay is almost independent of the characteristic of memory cells. Read access time increases by only 10% at  $\beta$  of 1 compared with conventional scheme. No read delay increase is observed when the NMOS switch is shared among N memory cells are negligibly small.



Fig. 2.17 Read delay as a function of parameter  $\beta$  and N.

#### 2.3.2. Noise margin

Fig. 2.18 shows a method of static noise margin analysis. Static noise margin is defined as the length of a line of maximum square in the area bounded by the transfer curve of the memory cell and its 45-degrees mirror [35-36]. Fig. 2.19 shows normalized simulated static noise margin. Static noise margin is degraded due to the additional NMOS switch in a memory cell. Static noise margin at  $\beta$  of 1 is 0.13 V<sub>DD</sub> while that of conventional scheme is 0.23 V<sub>DD</sub>. Considering noise margin degradation, it is preferable that  $\beta$  is more than 3. There is no noise margin decrease even when the NMOS switch is shared among N memory cells



Fig. 2.18 Method of static noise margin analysis.



Fig. 2.19 Static noise margin as a function of parameter  $\beta$  and N.

#### 2.3.3. Area overhead

Fig. 2.20 shows area overhead as a function of parameter  $\beta$  and N when cell area occupancy is assumed to be 60% of the total area of a SRAM macro. Area overhead is normalized, supposing area of conventional SRAM as 1.0. The more area penalty is reduced, the smaller the number of cells that share one NMOS switch becomes. In case of N = 2, area overhead does not decrease in the range that  $\beta$  is less than 3 because active area and contact limits minimum area. Considering area overhead, it is preferable that  $\beta$  is less than 3. Therefore, the parameter  $\beta$  of 3 is found to be best condition from results on noise margin analysis and area overhead.



Fig. 2.20 Area overhead as a function of parameter  $\beta$  and N.

Read access time and noise margin are independent of parameter N, thus larger N is preferable for SAC scheme in order to reduce area overhead. However, considering a problem of electro migration on source-line shared by N memory cells, larger N is not necessary a profitable choice because source-line should be designed thickly and another area overhead may happen. Thus N = 4 is chosen as the best condition.

Summing up above simulated results, both read delay and noise margin may be degraded due to the additional NMOS switch in a memory cell. But there is no read delay increase and no noise margin decrease even when the NMOS switch is shared among N memory cells. If  $\beta$  is large, it is good for delay and noise margin but area increases is expected. From design considerations,  $\beta=3$  and N=4 are chosen, which corresponds to 5% read access time increase, and 0.05 V<sub>DD</sub> noise margin decrease, and 11% area overhead increase.

#### Subthreshold leakage

Subthreshold leakage in memory cell arrays is significant issue due to huge number of memory cells. Reducing subthreshold leakage in a memory cell is normally accompanied with large area overhead [37], unlike logic circuits with cut-off transistor in multiple threshold schemes. SAC scheme is hopefully expected to be useful to cut subthreshold leakage power by controlling the gate voltage of an added NMOS switch.



Fig. 2.21 Subthreshold leakage current at  $\beta$  of 3.



Fig. 2.22 Static characteristic of a memory cell as a function of  $V_{SL}$ .

Fig. 2.21 shows subthreshold leakage current at  $\beta$  of 3 as a function of gate voltage of NMOS switch, and Fig. 2.22 illustrates static characteristic of a memory cell as a function of V<sub>SL</sub>. Subthreshold leakage current does not decrease until V<sub>SL</sub> is got down to 0.1 V since potential of the cell node which stores data '0' slightly increase when V<sub>SL</sub> is reduced. Cell node potential is not retained safely when V<sub>SL</sub> is got down under 0.2 V. Simulated results show that reduction of subthreshold leakage current without degrading noise margin is impossible, but there is possibility that data can be safely retained at low noise margin in a standby mode. The real characteristic of subthreshold leakage current should be discussed with measurement results.

#### 2.4. Experiment results and discussions

Fig. 2.23 shows a microphotograph of the first 64-Kbit test chip fabricated in 0.35-µm triple-metal CMOS process. The gate lengths of the NMOS and PMOS devices are both 0.40  $\mu$ m and the threshold voltage is 0.45 V. The memory cell is organized in 256 words  $\times$  256 bit. The memory cell size is  $5.45 \times 8.35 \ \mu\text{m}^2$  and four memory cell size with one NMOS switch is  $29.55 \times 8.35 \ \mu\text{m}^2$ . The test chip contains only 64-Kbit memory cell arrays and precharge circuits and read buffer. Output data is directly obtained from bit-line via inverter without sense amplifier. Fig. 2.24 shows measured waveforms of buffer data output at VDD of 1.5 V, 1 MHz. When write and read operation are done alternately, correct output waveform can be observed. Operation at 100 MHz is also possible in a simulation result, however, the operation cannot be performed because the test chip has no on-chip oscillator. Measurement result on power dissipation in memory cell arrays in write cycles is plotted in Fig. 2.11, and total write power dissipation at bit width of 256 is plotted in Fig. 2.12, summing up measurement results on write power in memory cell arrays and calculated power in peripheral circuits. SAC scheme is enables a write operation with low-swing bit-line and reduces the power dissipation in charging bit-line drastically. The total write power is reduced by 81% when bit width is 256.



Fig. 2.23 Microphotograph and cell layout of the 1st SRAM test chip



Fig. 2.24 Output waveform

Write power is reduced, however, area overhead of 7-transistor memory cells gets to be 35% of conventional cells since a attention is not paid to layout. Layout improvement of memory cells in SAC scheme is the emergent concern and another test chip is fabricated.

Fig. 2.25 shows layout of 64-Kbit SRAM on the 2nd test chip fabricated in 0.35-µm triple-metal CMOS process. Block selector, write driver and read buffer are designed with memory cells. Output data is directly obtained from bit-line via inverter without sense amplifier as on the 1st test chip. Fig. 2.26 illustrates revised memory cells with one NMOS switch. Area overhead of memory cell itself is 19% and if cell area occupancy is assumed to be 60% of the total area of a SRAM macro, area overhead is 11% compared to conventional six-transistor memory cell arrays.



Fig. 2.25 Layout of 64-Kbit SRAM on the 2nd test chip



Fig. 2.26 Layout of revised memory cells with one NMOS switch

#### 2.5. Summary

Sense amplifying cell (SAC) scheme is presented, which enables drastic write power saving by cutting power dissipation in memory cell arrays.

An additional NMOS switch is added to source terminal of driver transistor in a memory cell in order to make write operation with low-swing bit-line possible. And write voltage generator that is independent of V<sub>TH</sub> fluctuation is proposed. Assuming 4-Mbit on-chip SRAM, SAC scheme saves the total power dissipation in write cycles by 81% when bit width is 256.

Performance degradation resulted from an additional NMOS switch is discussed. An NMOS switch can be shared by N (N = 2, 4, 8) cells in order to reduce area overhead. Read delay, noise margin are estimated as a function of N and parameter  $\beta$  that is a ratio of the channel width of NMOS switch to the channel width of the access transistor. Both read delay and noise margin may be degraded. But there is no read delay increase and no noise margin decrease even when the NMOS switch is shared among N memory cells. Considering simulation results,  $\beta$ =3 and N=4 are chosen, which corresponds to 5% read access time increase and 0.05 Vpp noise margin decrease.

Two 64-Kbit test chips have been fabricated and a correct operation has been verified on the 1st test chip. Total write power is found to be saved by 81% from measurement result on the power in memory cell arrays and from calculation of the power in peripheral circuits. Layout of memory cells is revised on the 2nd test chip. And area overhead is improved to be 11% increase.

# Chapter 3 Power saving scheme for peripheral circuits and decoders

#### 3.1. Introduction

In chapter 2, drastic write power saving SRAM using sense-amplifying cell (SAC) scheme is discussed. Write power saving is achieved by cutting huge power dissipation due to charging and discharging of bit-line that is dominant in conventional SRAM. Now, the power dissipation of peripheral circuits is dominant. In case of 256-bit width SRAM, the power of peripheral circuits and decoders occupies approximately 87% of total power in write cycles in SAC scheme. Power of peripheral circuits and decoders is also dominant in read cycles, occupies approximately 80% of total power.

Peripheral circuits are composed of buffers, decoders, control circuits, sense amplifiers, and drivers that activate long wires. These circuits except sense amplifiers are logic circuits; hence the method of lowering the supply voltage (V<sub>DD</sub>) is useful to reduce the power consumption as described in chapter 1. Fig. 3.1 shows a proposed low-power dual-V<sub>DD</sub> SRAM architecture. Buffers, decoders and control circuits are operated at low supply voltage, and memory cells write drivers and sense amplifiers are operated at high supply voltage. Each signal is converted from low swing level to high swing level by level converters. These level converters should have the feature of high-speed in order to reduce delay increase resulted from level converters.



Fig. 3.1 Low-power dual-VDD SRAM architecture.

Level converting is a well-known technique for different purposes since a decade before. Some ten years backward BiCMOS technology would be often used for SRAM because of its feature of high-speed, and ECL-to-CMOS-level converters [38] translate the ECL input signal level to CMOS signal level. Another well-known famous use of level converter is I/O interfacing. Although the supply voltage inside chip has been lowered year after year, the swing level on board still remains at high level and there is a need for level converting at I/O interfaces. Low-swing bus architecture for low-power applications [39-40] also needs level converter for receiver circuits.

Various types of level converters for interconnect interface use are reported [41].



Fig. 3.2 Various types of level converters. (a) Conventional level converter. (b)
Symmetric source-follower driver with level converter (SSDLC) (driver is not shown here). (c) Static driver with VST (SDVST) (driver is not shown here). (d)
Level converter with low-V<sub>TH</sub> device (LCLVD). (e) Capacitive-coupled level converter (CCLC). (f) Level-converting register (LCR).

Type (a) is a conventional level converter with cross couple. It behaves like a differential sense amplifier by generating a complementary input signal internally. This simple cross couple type level converter has small delay, however, is unable to convert low-level signal around 0.5 V. Type (b) is only for the low-swing interconnect use whose swing level is from

 $V_{TH,n}$  to  $V_{DD}$ - $V_{TH,n}$ . Type (c) has a feature of high speed. Type (d) is the same as the conventional one except that it uses low- $V_{TH}$  MOS, therefore is enable to convert low-level signal around 0.5 V. Type (e) and (f) has a feature of both high speed and low-power, but require extra timing circuits.

From Fig. 3.2, conventional level converter with low- $V_{TH}$  MOS (type (d)) is the best choice as high-speed level converter at low  $V_{DD}$ . SDVST (type (c)) is also useful if it is composed of low- $V_{TH}$  MOS's. Both level converters, however, are not still fast enough to convert low-swing decoded signal because level converter will be inserted into critical path and the delay of level converter should be possibly less.

#### 3.2. Bypass level converter (BLC) and revised pass transistor type level converter

Fig. 3.3 shows conventional level converter with low V<sub>TH</sub>-MOS, proposed bypass level converter (BLC) and revised pass transistor type level converter. Each level converter uses multiple threshold voltage MOS in order to convert low-swing signal around 0.5 V. An additional NMOS N3 is inserted between input node and output node in the proposed level converter and the reference voltage is applied to the gate of N3. The reference voltage V<sub>REF</sub> is defined as  $V_{DD} - V_{TH,LOW}$ . Pass transistor type level converter is the one where N2 and a inverter are removed from BLC. Pass transistor type level converter [42] is often used in I/O circuits and V<sub>DDH</sub> is usually applied to the gate of pass transistor N3. Now the reference voltage is applied to the gate of pass transistor N3.

The reason why BLC is faster than conventional level converter is explained from Fig. 3.4 and Fig. 3.5, which show the simulated waveforms by HSPICE simulation in 0.13- $\mu$ m generation. In the conventional level converter, potential of output node OUT is changed after output signal of inverter INbar appears. In case of BLC, when input signal IN is changed from low to high, transistor N3 is on for a while and potential of OUT increase pulled up by the low voltage V<sub>DDL</sub> as shown within the left eclipse in Fig. 3.5. As the potential of OUT goes up above V<sub>REF</sub> – V<sub>TH</sub>, N3 turned off and potential of OUT is pulled up by the high voltage V<sub>DDH</sub>. When input signal IN is changed from high to low, transistor N3 turned on and the potential of OUT is immediately pulled down by V<sub>DDL</sub> as shown within the right eclipse in Fig. 3.5. The function of pass transistor type level converter is the same as that of BLC. The only difference is that there is no transistor N2 in the pass transistor type level converter to pulled down the potential of OUT.



Fig. 3.3 Circuit schematic of level converters. (a) Conventional level converter with low-V<sub>TH</sub> MOS. (b) Proposed bypass level converter (BLC) (c) Revised pass transistor type level converter.



Fig. 3.4 Simulated waveforms of conventional level converter.



Fig. 3.5 Simulated waveforms of bypass level converter.

Delay and energy dissipation of three types of level converters are compared by HSPICE simulation. Delay is estimated by equaling rise delay and fall delay of output node. The width of all the transistors is optimized with each parameter in any level converters. Fig. 3.6 and Fig. 3.7 show the dependence on the high supply voltage (V<sub>DDH</sub>) and on the output load capacitance (C<sub>L</sub>) respectively. And the dependence on the threshold voltage of Low-V<sub>TH</sub> MOS

(V<sub>TH, LOW</sub>) is shown from Fig. 3.8 to Fig. 3.10. At any case, BLC is faster than conventional level converter, and revised pass transistor type level converter is the fastest. The simulation results tell that transistor N2 and a inverter in the BLC is useless for converting and transistor N3 plays a important role in converting. Pass transistor type level converter is also has the best results with respect to energy dissipation except Fig. 3.7. When V<sub>TH, LOW</sub> is low, energy dissipation increases due to leakage current and when V<sub>TH, LOW</sub> is high, energy also increases because of short circuit current.



Fig. 3.6 Dependence of level converters on V<sub>DDH</sub>. (a) Delay. (b) Energy.



Fig. 3.7 Dependence of level converters on C<sub>L</sub>. (a) Delay. (b) Energy.



Fig. 3.8 Dependence on VTH, LOW at VDDH of 1.0V. (a) Delay. (b) Energy.



Fig. 3.9 Dependence on VTH, LOW at VDDH of 1.5V. (a) Delay. (b) Energy.



Fig. 3.10 Dependence on VTH, LOW at VDDH of 2.0V. (a) Delay. (b) Energy.
#### 3.3. Replica-biased level converter

It becomes obvious that revised pass transistor type level converter is the fastest one in section 3.2. However much faster level converter is required for critical path in a dual-V<sub>DD</sub> SRAM architecture.

Considering dual-V<sub>DD</sub> SRAM architecture, level converter is inserted between row decoders and global word line as shown in Fig. 3.11. Only one path gets active and the other paths are not selected. Thus the power dissipation of level converter is not significant concern. Then level-converting decoder that function as level converter and as AND gate at the same time is presented.



Fig. 3.11 Dual-VDD SRAM architecture with level-converting decoder.

Fig. 3.12 shows various types of level converters with function of AND gate. Type (a) and type (b) are the same level converters as shown in section 3.2 and type (c) and type (d) are new proposed level converters. Pseudo NMOS type level converter is a simplest and fastest one, however, it is extremely robust against threshold voltage fluctuation and gate width fluctuation. Type (d) is a revised pseudo NMOS level converter with replica biasing, which reduce the influence by device fluctuations.



Fig. 3.12 Various types of level converters with function of AND gate. (a) Conventional level converter. (b) Revised pass transistor type level converter. (c) Pseudo NMOS type level converter. (d) Pseudo NMOS type level converter with replica biasing

Delay and energy dissipation of these four kinds of level converters with function of AND gate are compared by HSPICE simulation in 0.13-µm generation. Delay is estimated by equaling rise delay and fall delay of output node, and energy is estimated supposing that cycle time of both rise and fall is 1ns. The width of all the transistors is optimized at each parameter in any level converters. Fig. 3.13 and Fig. 3.14 show the dependence on the high supply voltage (V<sub>DDH</sub>) and on the output load capacitance (C<sub>L</sub>) respectively. And the dependence on the threshold voltage of Low-V<sub>TH</sub> MOS (V<sub>TH, LOW</sub>) is shown in Fig. 3.15.





Fig. 3.15 Dependence on VTH, LOW. (a) Delay. (b) Energy.

Pseudo NMOS level converter is the fastest and replica-biased level converter is the next at With respect to energy dissipation, pseudo NMOS and replica-biased level any case. converters are worse than conventional and pass transistor type level converters because of static currents while output data is '1'. But the static current is not a significant issue since only one level converter selected by row address is activated as shown in Fig. 3.11. Pseudo NMOS level converter is the fastest one when the width of transistor is optimized as to each parameter, however, pseudo NMOS level converter is very robust against VTH, Low fluctuation. Fig. 3.16 shows on the dependence on V<sub>TH, LOW</sub> fluctuation of NMOS's in any level converters, assuming that widths of all the transistor is optimized and fixed at  $V_{DDH}$  of 1.5V and  $V_{TH, LOW}$ of 0.19V. Rise delay and fall delay are equal in any level converter when there is no VTH, LOW fluctuation, however, read delay of conventional and pass transistor type and pseudo NMOS level converters increases drastically when VTH, LOW fluctuates to high direction. Pseudo NMOS level converter does not work when VTH, LOW fluctuates by +0.10V. Only replica-biased level converter keeps the same delay between read delay and fall delay, and delay increase of replica-biased level converters is much smaller than that of the other level converters. Therefore replica-biased level converter is suitable for low-power dual-VDD SRAM architecture as level-converting decoders.



Fig. 3.16 Dependence on VTH fluctuation.

Fig. 3.17 shows calculated total power dissipation of assumed 4-Mbit SRAM in a write cycle and in a read cycle versus bit width. Power dissipation in Dual-V<sub>DD</sub> SRAM architecture and SAC scheme and moreover the power dissipation in case of both schemes are adopted is calculated. SAC scheme reduces the power by 81% and dual-V<sub>DD</sub> SRAM architecture cuts the power of peripheral circuits by 54%, then the power is totally saved in a write cycle by 90% when both schemes are adopted, as shown in Fig. 3.18. Read power is also saved by 48% by dual-V<sub>DD</sub> SRAM architecture.



Fig. 3.17 Calculated total power dissipation of assumed 4-Mbit SRAM in a write cycle and in a read cycle versus bit width.



Fig. 3.18 Power consumption of assumed 4-Mbit at bit width of 256.

# 3.4. Fabrication

Two test chips containing level converters are fabricated. Fig. 3.19 shows layout of four kinds of level converters, conventional, BLC, pseudo NMOS, and replica-biased level converter fabricated in 0.35-µm CMOS triple-metal CMOS process. Dummy word line connected to 256 cells is attached to each level converter in order to evaluate approximate delay of word line in 256-bit width SRAM.



Fig. 3.19 Layout of level converters in 0.35-µm technology.



Fig. 3.20 Layout of 2-Kbit SRAM with level converters in 0.25-µm SOI technology.



Fig. 3.21 Layout and circuit schematic of level-converting column decoder.

Fig. 3.19 shows layout of 2-Kbit SRAM with replica-biased level converters in 0.25-µm SOI technology. This 2-Kbit SRAM is embedded as a memory for test processor chip operated at the supply voltage of 0.5V. The entire input signals, such as addresses, data, control signals, and clock are inputted at the swing level of 0.5V and converted to the level of 1.0V by replica-biased level converters with function of AND gate. A layout of replica-biased level converter inserted in column decoder is shown in Fig. 3.21. The biasing voltage V<sub>BIAS</sub> should be stable so that fast level converting is smoothly operated, therefore the biasing voltage line is shielded on both sides by two ground lines that are directly come from outside ground ring line. The detail of this test chip is described in appendix B.2.1.

#### 3.5. Summary

A new low-power dual-V<sub>DD</sub> SRAM architecture is proposed. Buffers, decoders and control circuits are operated at the low supply voltage, and memory cells write drivers and sense amplifiers are operated at the high voltage. Each signal is converted from low swing level to high swing level by level converters. These level converters should have the feature of high-speed in order to reduce delay increase resulted from level converters, therefore high-speed level converters are discussed.

First, bypass level converter (BLC) and revised pass transistor type level converter are proposed. Both delay and energy dissipation are improved, however, these level converters are not enough for critical path in decoders.

Then, pseudo NMOS type level converter with replica biasing is proposed, which have both the function of level converting and the function of AND gate. It achieves high-speed level converting with small influence by threshold voltage fluctuation.

Assuming 4-Mbit SRAM, dual-V<sub>DD</sub> SRAM architecture cuts the power of peripheral circuits by 54%, and the power is totally saved in a write cycle by 90% when dual-V<sub>DD</sub> SRAM architecture adopted with SAC scheme.

A test chip of four types of level converters is fabricated in 0.35-µm technology and a 2-Kbit SRAM with level converters is fabricated in 0.25-µm SOI technology.

#### 4.1. Introduction

Register files in a microprocessor are an important storage element. Register files are used to provide read and write data access for microprocessor in a clock cycle and its hit rate is extremely high compared to the other memory. Therefore the register file is one of the most power-consuming blocks and a low-power approach is required.

Register file cell is normally the same as SRAM cell, which is composed of six-transistor, and register file has multi-port cell in order to enable both write and read operation at the same column simultaneously. Fig. 4.1 shows the cell structure in the basic register file, three-port memory cells with 2 read and 1 write ports. The memory cell contains extra gates to avoid the influence of read operation in the cell data. There are peripheral circuits and decoders for each port. There is no difference in structure between register file and SRAM except the number of port. Thus, sense-amplifying cell (SAC) scheme shown in chapter 2 and dual-V<sub>DD</sub> architecture shown in chapter 3 are also useful for register files.



Fig. 4.1 Circuit schematic of three-port register file.

Section 4.2 describes the application of sense-amplifying cell (SAC) to register file and write power reduction in charging bit-line is discussed. Section 4.3 argues the application of dual-V<sub>DD</sub> architecture with replica-biased level converter to register file and power reduction in peripheral circuits and decoders is discussed.

## 4.2. Sense-amplifying cell (SAC) scheme for register files

Fig. 4.2 (a) and (b) show the circuit schematics of register file cells with an additional NMOS switch. The cell itself is the same as that of SAC scheme SRAM except that there are one write-port and two read-ports. The cell functions like sense amplifier in a write operation with low-swing bit-line so that power dissipation in charging of bit-line is dramatically reduced. The difference of the cells in Fig. 4.2 (a) and (b) lies in the connection of write-port and read-ports. The former one (shared data node type) is simpler, however, read operation disturb the data node, thus, the latter one (separated data node type) contains extra gates to avoid the influence in the cell data.



Fig. 4.2 Circuit schematics of register file cells. (a) Write port and read port are connected to the same node through pass transistor. (b) Extra gates separate data node and read port to avoid the influence of read operation in the cell data.

Static noise margin of register file cells is simulated using the circuit described in Fig. 4.3, and Fig. 4.4 shows simulated results of register files on static noise margin. The shared node type register files has the feature of decrease in static noise margin during a read operation. Proposed register file cell has worse result than that of conventional register file cell due to an additional NMOS switch as shown in Fig. 4.4 (a1) and (a2). But in case of separated data node type of register file, there is no decrease in static noise margin during a read operation, thus there is no degradation in the proposed register file as shown in Fig. 4.4 and (b1) and (b2). The characteristic of read delay and static noise margin depends on only extra gates. Therefore there is no performance degradation when proposed cell with an additional NMOS switch is used, different from SAC scheme SRAM cells.



Fig. 4.3 Method of static noise margin analysis.



Fig. 4.4 Simulation results on static noise margin. (a) Shared node type. (b) Separated node type. (1) Conventional. (2) Proposed.

. Fig. 4.5 shows layout of two types of register files, shared data node type and separated data node type in Rohm 0.35- $\mu$ m technology. Two types of register files are included at different columns. The memory cell is organized in 64 word × 16 bit. Shared data node type register file cells are located at the right 8 columns and separated data node type register file cells are located at the left 8 columns. The test chip contains memory cell arrays and precharge circuits, write driver and read buffer. Fig. 4.6 shows layout pattern of one register file cell, and one NMOS switch is shared by four memory cells as shown in Fig. 4.7. The memory cell size is  $11.0 \times 14.0 \ \mu$ m<sup>2</sup>. Increase in area overhead is only 6% compared with that in 4-Mbit SRAM in Chapter 2; 11% increase. Therefore SAC scheme is more suitable for a register file.



Fig. 4.5 Layout of register files in 0.35-µm technology.



Fig. 4.6 Layout of separated data node type register file cell.



Fig. 4.7 Layout of four-register file cells with one NMOS switch.

#### 4.3. Dual- $V_{DD}$ architecture for register files

As cell size of register file is much smaller than that of SRAM, the power dissipation in peripheral circuits and decoders is more dominant compared to SRAM. Thus dual-V<sub>DD</sub> SRAM architecture is more useful for register file. Peripheral circuits and decoders are operated at lower supply voltage and memory cell is operated at higher voltage. Replica-biased level converter, which has a feature of high-speed and has function of AND gate, converts low-swing signal to high-swing signal with allowable converting delay.

Fig. 4.8 shows calculated total power dissipation of assumed 64-word register file with one write port and two read ports. The power consumption is estimated as a function of bit width. SAC scheme reduces the power dissipation in bit-line at write port, and reduces total power by 35% at bit width of 256 as shown in Fig. 4.9. Dual-V<sub>DD</sub> architecture cuts the power of peripheral circuits at read port by 47%. The power of peripheral circuits at write port is scarcely saved since only a part of decoders is operated at low supply voltage at write port. When both schemes are adopted, the power is totally saved in a write cycle by 60%. The detail of power calculation is described in appendix A.3.



Fig. 4.8 Calculated total power dissipation of assumed 64-word register file with one write port and two read ports versus bit width.



Fig. 4.9 Power consumption of register file at bit width of 256.

Fig. 4.10 shows layout of register file with replica-biased level converters in 0.25- $\mu$ m SOI technology. 16-word × 16-bit register file is fabricated. Fig. 4.11 shows layout of one register file cell. The memory cell size is 7.64 × 10.12  $\mu$ m<sup>2</sup>. The structure of this test chip is the same as that of 2-Kbit SRAM fabricated at the same technology. The detail of test chip is described in appendix B.2.2.



Fig. 4.10 Layout of register file with level converters in 0.25-µm SOI technology.



Fig. 4.11 Layout of register file cell.

#### 4.4. Summary

Low-power approaches for register files are discussed using the same method as that for SRAM described in chapter 2 and chapter 3.

First, sense-amplifying cell (SAC) scheme is applied for register files and power reduction in charging bit-lines is achieved, and test chip is fabricated. Total area increases by 6%.

Next, dual-V<sub>DD</sub> architecture with lower supply voltage for peripheral circuits and higher suppler voltage for memory cells enables power reduction in peripheral circuits and decoders. Replica-biased level converter with function of AND gate converts low-swing signal to high-swing signal with allowable converting delay. 16-word  $\times$  16-bit register file is fabricated.

When 64-word  $\times$  256-bit register file with one write port and two read ports is assumed, SAC scheme reduces total power by 35%, and dual-V<sub>DD</sub> architecture cuts the power of peripheral circuits at read port by 47%. The total power is saved by 60% when both schemes are adopted for register file.

Therefore SAC scheme and dual-V<sub>DD</sub> architecture are proved to be also useful for register files as well as SRAM's.

The next generation LSI's is expected to be suffering from power crisis caused by rapid increase in its performance, and power saving scheme in LSI's is most emergent concern. SRAM is often used as high-speed memory in a processor and it occupies approximately 30% of total power. Write power dissipation in SRAM is particularly dominant in continuous write and read operation due to full-swing nature of bit-line. Peripheral circuits and decoders also consume much power at large bit width due to huge number of heavy long capacitance buses.

Therefore, this research is devoted to present write power saving scheme for SRAM without large penalty. In this work, two key techniques for write power saving has been proposed.

 Sense-amplifying cell (SAC) scheme for power saving in bit-line. Power dissipation in charging of bit-line with high capacitance is reduced using low-swing bit-line. An additional NMOS switch enables a write operation with low-swing bit-line. Performance degradation due to NMOS switch is 5% read access time increase, 0.05V<sub>DD</sub> noise margin decrease, and 11% area overhead increase. A correct operation has been verified on the test chip. Total write power is reduced by 81%

in 4-Mbit SRAM at bit width of 256 from measurement results and calculation.

• Dual-V<sub>DD</sub> SRAM architecture for peripheral circuits.

Power dissipation in peripheral circuits is reduced by lowering  $V_{DD}$  of decoders, data buses, buffers, and control circuits. Low swing signal is converter to high swing signal by proposed high-speed level converter, replica-biased level converter, which has features of high immunity to  $V_{TH}$  fluctuation and function of AND gate. Thus there is no significant delay for decoding. Assuming 4-Mbit SRAM, dual- $V_{DD}$  SRAM architecture cuts the power of peripheral circuits by 54%, and the power is totally saved in a write cycle by 90% when dual- $V_{DD}$  SRAM architecture adopted with SAC scheme. A test chip of four types of level converters is fabricated and a 2-Kbit SRAM with level converters is fabricated.

These two schemes are also applied for register files. Register file has advantage in using sense-amplifying cell (SAC) scheme compared with SRAM in that area increase can be kept low since cell size is larger than that of SRAM cell. Total area increases 6% when SAC scheme is applied for 64-word  $\times$  256-bit register file. SAC scheme reduces total power by

35%, and dual-V<sub>DD</sub> architecture cuts the power of peripheral circuits at read port by 47%. The total power is saved by 60% when both schemes are adopted for register file.

# A.1. Sense-amplifying cell (SAC) scheme



Fig. A.1 Circuit schematic of the whole of assumed 4-Mbit SRAM.

| 1024 columns<br>= 6.643mm<br>1 Mb<br>(Block 4-7)<br>1024 rows<br>= 9.216mm | Row decoder     | Block3                                                               | Block selector<br>Block2 | Block1 | Block0<br>Block0 |
|----------------------------------------------------------------------------|-----------------|----------------------------------------------------------------------|--------------------------|--------|------------------|
| Write driver & Data line &<br>I/O buffer Column dec. &<br>Sense amp.       | Pre-<br>decoder | Data line & Write driver &<br>Column dec. & I/O buffer<br>Sense amp. |                          |        |                  |
| 1 Mb<br>(Block 12-15)                                                      | Row decoder     | Block11                                                              | Block10                  | Block9 | Block8           |

Fig. A.2 Floor plan of assumed 4-Mbit SRAM.

 $C_G = 1.3 \times 10^{-15} \text{ F/}\mu\text{m}$ : gate equivalent capacitance of NMOS  $C_{\rm D} = 1.7 \times 10^{-15} \text{ F/}\mu\text{m}$ : drain equivalent capacitance of NMOS  $C_{M2} = 0.062 \times 10^{-15}$  F/µm : metal #2 parasitic capacitance  $C_{M3} = 0.057 \times 10^{-15}$  F/µm : metal #3 parasitic capacitance  $C_{M2, BL} = 0.157 \times 10^{-15}$  F/µm : bit line parasitic capacitance  $C_{M2, DL} = 0.105 \times 10^{-15}$  F/µm : data line parasitic capacitance  $C_{M2, DECL} = 0.019 \times 10^{-15}$  F/µm : decoded line parasitic capacitance : horizontal length of 1 Mbit SRAM cells  $L_{x} = 6.6 \text{ mm}$  $L_{Y} = 9.2 \text{ mm}$ : vertical length of 1 Mbit SRAM cells  $W_{3nand} = 3.5 \ \mu m$ : the sum of NMOS and PMOS gate width of 3nand gate  $W_{2nand} = 3.0 \ \mu m$ : the sum of NMOS and PMOS gate width of 2nand gate  $W_{pre4nand}$  = 8.0  $\mu$ m : the sum of NMOS and PMOS gate width of 4nand gate  $W_{pre3nand}$  = 7.0  $\mu$ m : the sum of NMOS and PMOS gate width of 3nand gate  $W_{pre2nand} = 6.0 \ \mu m$  : the sum of NMOS and PMOS gate width of 3 nand gate B = 16, 32, 64, 128, 256 : bit width Cell pitch =  $6.5 \,\mu\text{m}$  : one NMOS switch is shared by 4 cells  $V_{DD} = 1.5 V$ : supply voltage f<sub>CLK</sub> = 100 MHz : clock frequency  $V_{DDH} = 1.5 V$ : high V<sub>DD</sub> : low V<sub>DD</sub>  $V_{DDL} = 0.75 V$ λ-1 : the ratio of driver capacitance to load

Fig. A.3 Parameter of 4-Mbit SRAM in Rohm 0.35-µm technology.





 $C_{EQ} = (12\mu m \times 2 \times B \times C_{G} + L_{Y} \times C_{M2} + L_{X} \times C_{M2} + L_{X} \times C_{M3}) \times \lambda$  $C_{SL}^{L_{x}} = (1.8\mu m \times B \times C_{G} + L_{Y} \times C_{M2} + L_{X} \times C_{M2} + L_{X} \times C_{M3}) \times \lambda$  $C_{WE} = (7\mu m \times 2 \times B \times C_G)$ +  $L_X \times C_{M3}$ ) ×  $\lambda$ +  $L_X \times C_{M2}$  +  $L_X \times C_{M3}$  ) ×  $\lambda$ λ:  $C_{SAE} = (10 \mu m)$  $\times B \times C_G$ +  $L_X \times C_{M3}$ )  $\times \lambda$ Load  $C_{CLK} = ($ driver +  $L_X \times C_{M3} \times B$ )  $\times \lambda$  $C_{Din} = (7\mu m \times 2 \times B \times C_G)$ ratio +  $L_X \times C_{M2} \times B/2$  +  $L_X \times C_{M3} \times B$ ) ×  $\lambda$  $C_{\text{Dout}} = ($ +  $L_X \times C_{M3} \times (22 - \log_2 B)) \times \lambda$  $C_{ADDR} = ($  $C_{DL} = L_X \times C_{M2, DL} \times B$  $C_{BL} = (L_Y \times C_{M2, BL} + 0.6 \mu m \times 1024 \times C_D) \times B$  $= C_{LWL} + C_{GWL} + C_{DECL} + C_{preDEC} + C_{BSL} + C_{COLDEC}$ C<sub>decoder</sub>  $= \{(L_{LWL} + L_X) \times C_{M3} + L_Y \times C_{M2, DECL} \times 3 + L_Y \times C_{M2}\}$ + (61.2 $\mu$ m × B + 320 × W<sub>3nand</sub> + W<sub>pre3nand</sub> × 48 + W<sub>pre4nand</sub> × 64) × C<sub>G</sub> +  $C_{COLDEC}$  } ×  $\lambda$ C<sub>COLDEC</sub> =  $B = 16: (W_{2nand} \times 16 + W_{pre3nand} \times 48) \times C_{G}$  $B = 32: (W_{2nand} \times 12 + W_{pre3nand} \times 24 + W_{pre2nand} \times 8) \times C_{G}$  $B = 64: (W_{2nand} \times 8 + W_{pre2nand} \times 16) \times C_G$ B = 128:  $W_{3nand} \times 8 \times C_{G}$ B = 256:  $W_{2nand} \times 4 \times C_{G}$ 



#### Write cycle

$$\begin{split} C_{\text{peri, total}} &= C_{\text{EQ}} + C_{\text{SL}} + C_{\text{WE}} + C_{\text{CLK}} + C_{\text{Din}} + C_{\text{ADDR}} + C_{\text{decoder}} \\ &= (54\mu\text{m} \times \text{B} \times \text{C}_{\text{G}} + (\text{L}_{\text{X}} + \text{L}_{\text{Y}}) \times \text{C}_{\text{M2}} \times 2 + \text{L}_{\text{X}} \times \text{C}_{\text{M3}} \times (\text{B} + 4 + (22 - \log_2 \text{B}))) \times \lambda \\ &+ C_{\text{decoder}} \\ C_{\text{BL}} + C_{\text{DL}} &= (\text{L}_{\text{Y}} \times \text{C}_{\text{M2, BL}} + 614\mu\text{m} \times \text{C}_{\text{D}} + \text{L}_{\text{X}} \times \text{C}_{\text{M2, DL}}) \times \text{B} \\ \text{Power (write)} &= P_{\text{peri}} + P_{\text{BL, DL}} + P_{\text{cell}} \\ &= f_{\text{CLK}} \times \{(\text{C}_{\text{peri, total}} \times \text{V}_{\text{DD}}^2 + (\text{C}_{\text{BL}} + \text{C}_{\text{DL}}) \times (1/6 \text{ V}_{\text{DD}})^2) + \frac{(10\text{fQ} \times \text{V}_{\text{DD}} \times \text{B})}{10\text{fQ} \times \text{V}_{\text{DD}} \times \text{B}} \} \end{split}$$

#### Read cycle

$$\begin{split} & C_{\text{peri, total}} = C_{\text{EQ}} + C_{\text{SAE}} + C_{\text{CLK}} + C_{\text{Dout}} + C_{\text{ADDR}} + C_{\text{decoder}} \\ &= (34\mu\text{m} \times \text{B} \times \text{C}_{\text{G}} + ((\text{B}/2 + 2) \times \text{L}_{\text{X}} + \text{L}_{\text{Y}}) \times \text{C}_{\text{M2}} + \text{L}_{\text{X}} \times \text{C}_{\text{M3}} \times (\text{B} + 3 + (22 - \log_2 \text{B}))) \times \lambda \\ &+ C_{\text{decoder}} \\ &C_{\text{BL}} + C_{\text{DL}} = (\text{L}_{\text{Y}} \times \text{C}_{\text{M2, BL}} + 614\mu\text{m} \times \text{C}_{\text{D}} + \text{L}_{\text{X}} \times \text{C}_{\text{M2, DL}}) \times \text{B} \\ &\text{Power (read)} = \text{P}_{\text{peri}} + \text{P}_{\text{BL, DL}} + \text{P}_{\text{sense amp}} \\ &= f_{\text{CLK}} \times \{(\text{C}_{\text{peri, total}} \times \text{V}_{\text{DD}}^2 + (\text{C}_{\text{BL}} + \text{C}_{\text{DL}}) \times \text{V}_{\text{DD}} \times 0.1\text{V}) + \underbrace{\begin{array}{c} \text{Sense amp} \\ \text{50fQ} \times \text{V}_{\text{DD}} \times \text{B} \end{array} \right]} \end{split}$$





Fig. A.8 Critical path from memory cell to data output.

 $\begin{array}{l} \tau_1 = ((L_X \times C_{M3} \\ + C_{G1}) \times V_{DD}) \ / \ I_1 \\ = ((6.6 \times 0.57 + 0.02) \times 1.5) \ / \ 2.0 \times 10^{-3} \\ = 0.30 \ \text{ns} \\ \tau_2 \\ = ((L_Y \times C_{M2, \ DECL} + C_{G2}) \times V_{DD}) \ / \ I_2 \\ = ((9.2 \times 0.19 + 0.58) \times 1.5) \ / \ 1.5 \times 10^{-3} \\ = 0.75 \ \text{ns} \\ \end{array}$  $\tau_3 = ((\ (3/4)L_X \times C_{M3} + C_{G3}) \times V_{DD}) \ / \ I_3 = ((4.9 \times 0.57 + 0.02) \times 1.5) \ / \ 0.5 \times 10^{-3} = 0.90 \ \text{ns}$  $\tau_4 = ((\ (1/4)L_X \times C_{M3} + C_{G4}) \times V_{DD}) \ / \ I_4 = ((1.7 \times 0.57 + 0.40) \times 1.5) \ / \ 1.5 \times 10^{-3} = 0.50 \ \text{ns}$  $\tau_5$ : depends on parameter  $\beta$  $\tau_6 = 0.97$  ns (simulated) +  $C_{G7}$  ×  $V_{DD}$  /  $I_7$  = ((6.6 × 0.62 + 0.03) × 1.5) / 1.5 × 10<sup>-3</sup> = 0.60 ns  $\tau_7 = ((L_X \times C_{M2}))$  $(1.5) \times V_{DD}$  /  $I_8$  = ((6.6 × 0.57 + (1.5) / 2.0 × 10<sup>-3</sup> = 0.29 ns  $\tau_8 = ((L_X \times C_{M3}))$  $\tau_1^{+} = 0.18 \text{ ns}$  $\tau_{2}$ ' = 0.18 + 0.22 + 0.18 + 0.30 = 0.88 ns  $\tau_{3}^{+} = 0.27 \text{ ns}$  $\tau_4$ ' = 0.22 ns  $\tau_7^{'} = 0.22 + 0.22 + 0.18 = 0.62 \text{ ns}$  $\tau_8' = 0.22 + 0.22 + 0.18 = 0.62 \text{ ns}$  $\tau_{ADDR} = \tau_1^{'} + \tau_1 + \tau_2^{'} + \tau_2 + \tau_3^{'} + \tau_3 + \tau_4^{'} + \tau_4 = 4.00 \text{ ns}$  $\tau_{\text{sense}} = \tau_5 + \tau_6$  $\tau_{Dout} = \tau_7` + \tau_7 + \tau_8` + \tau_8 = 2.13 \text{ ns}$ 

Total read access time =  $\tau_{ADDR} + \tau_{sense} + \tau_{Dout} = 7.10 \text{ ns} + \tau_5$ 

Fig. A.9 Read access time estimation of 4-Mbit SRAM.

## A.2. Dual-V<sub>DD</sub> SRAM



Fig. A.10 Capacitance estimation of 4-Mbit dual-V<sub>DD</sub> SRAM circuits.

#### Write cycle

$$\begin{split} & C_{\text{peri, total, VDDH}} = (54\mu\text{m} \times \text{B} \times \text{C}_{\text{G}} + \text{L}_{\text{Y}} \times \text{C}_{\text{M2}} \times 2) \times \lambda + \text{C}_{\text{decoder, VDDH}} \\ & C_{\text{peri, total, VDDL}} = (\text{L}_{\text{X}} \times \text{C}_{\text{M2}} \times 2 + \text{L}_{\text{X}} \times \text{C}_{\text{M3}} \times (\text{B} + 4 + (22 - \log_2 \text{B}))) \times \lambda + \text{C}_{\text{decoder, VDDL}} \\ & C_{\text{BL}} + \text{C}_{\text{DL}} = (\text{L}_{\text{Y}} \times \text{C}_{\text{M2, BL}} + 614\mu\text{m} \times \text{C}_{\text{D}} + \text{L}_{\text{X}} \times \text{C}_{\text{M2, DL}}) \times \text{B} \\ & \text{Power (write)} = \text{P}_{\text{peri, VDDH}} + \text{P}_{\text{peri, VDDL}} + \text{P}_{\text{BL, DL}} + \text{P}_{\text{cell}} \\ & = f_{\text{CLK}} \times \{(\text{C}_{\text{peri, total, VDDH}} \times \text{V}_{\text{DDH}}^2 + \text{C}_{\text{peri, total, VDDL}} \times \text{V}_{\text{DDL}}^2 \\ & + (\text{C}_{\text{BL}} + \text{C}_{\text{DL}}) \times (1/6 \text{ V}_{\text{DDH}})^2) + 10f\text{Q} \times \text{V}_{\text{DDH}} \times \text{B} \} \end{split}$$

#### Read cycle

Fig. A.11 Power estimation of 4-Mbit dual-VDD SRAM circuits.

## A.3. Register file

```
C_{G} = 1.3 \times 10^{-15} \text{ F/}\mu\text{m}
                                   : gate equivalent capacitance of NMOS
C_{\rm D} = 1.7 \times 10^{-15} \text{ F/}\mu\text{m}
                                   : drain equivalent capacitance of NMOS
C_{M2} = 0.062 \times 10^{-15} \text{ F/}\mu\text{m}
                                  : metal #2 parasitic capacitance
C_{M3} = 0.057 \times 10^{-15} F/µm : metal #3 parasitic capacitance
C_{M2, BL} = 0.157 \times 10^{-15} \text{ F/}\mu\text{m} : bit line parasitic capacitance
C_{M2, DECL} = 0.019 \times 10^{-15} F/µm : decoded line parasitic capacitance
L_X = Cell pitch × B : horizontal length of register file cells
L<sub>Y</sub> = 896 μm
                       : vertical length of register file cells
W<sub>2nand</sub> = 3.0 μm
                       : the sum of NMOS and PMOS gate width of 2nand gate
W_{pre3nand} = 3.5 \ \mu m : the sum of NMOS and PMOS gate width of 3 nand gate
B = 16, 32, 64, 128, 256
                                  : bit width
Cell pitch = 11.0 \mum : one NMOS switch is shared by 4 cells
f<sub>CLK</sub> = 100 MHz
                       : clock frequency
                       : high V<sub>DD</sub>
V_{DDH} = 1.5 V
V_{DDL} = 0.75 V
                       : low V<sub>DD</sub>
λ-1
                       : the ratio of driver capacitance to load
```

Fig. A.12 Parameter of 64-word × 256-bit register file in Rohm 0.35-µm technology.



Fig. A.13 Capacitance estimation of 64-word  $\times$  256-bit dual-V<sub>DD</sub> register file.

#### Write port

$$\begin{split} & \textbf{C}_{\text{peri, total, VDDH}} = (16 \mu m \times B \times C_{\text{G}} + \textbf{L}_{X} \times \textbf{C}_{\text{M2}} \times 2) \times \lambda + \textbf{C}_{\text{decoder, VDDH}} \\ & \textbf{C}_{\text{peri, total, VDDL}} = \textbf{C}_{\text{decoder, VDDL}} \\ & \textbf{C}_{\text{BL}} = (\textbf{L}_{\text{Y}} \times \textbf{C}_{\text{M2, BL}} + 38.4 \mu m \times \textbf{C}_{\text{D}}) \times B \\ & \textbf{Power} (\text{write}) = \textbf{P}_{\text{peri, VDDH}} + \textbf{P}_{\text{peri, VDDL}} + \textbf{P}_{\text{BL}} + \textbf{P}_{\text{cell}} \\ & = f_{\text{CLK}} \times \{(\textbf{C}_{\text{peri, total, VDDH}} \times \textbf{V}_{\text{DDH}}^2 + \textbf{C}_{\text{peri, total, VDDL}} \times \textbf{V}_{\text{DDL}}^2 \\ & + \textbf{C}_{\text{BL}} \times (1/6 \ \textbf{V}_{\text{DDH}})^2) + 10fQ \times \textbf{V}_{\text{DDH}} \times B \} \end{split}$$

## Read port

$$\begin{split} & \textbf{C}_{\text{peri, total, VDDH}} = (16 \mu m \times B \times C_{G} + L_{X} \times C_{M2} \times 2) \times \lambda + \textbf{C}_{\text{decoder, VDDH}} \\ & \textbf{C}_{\text{peri, total, VDDL}} = (L_{X} \times C_{M2} \times B/2) \times \lambda + \textbf{C}_{\text{decoder, VDDL}} \\ & \textbf{C}_{\text{BL}} = (L_{Y} \times C_{M2, \text{BL}} + 38.4 \mu m \times C_{D}) \times B \\ & \textbf{Power (read)} = \textbf{P}_{\text{peri, VDDH}} + \textbf{P}_{\text{peri, VDDL}} + \textbf{P}_{\text{BL}} + \textbf{P}_{\text{sense amp}} \\ & = f_{\text{CLK}} \times \{(\textbf{C}_{\text{peri, total, VDDH}} \times \textbf{V}_{\text{DDH}}^{2} + \textbf{C}_{\text{peri, total, VDDL}} \times \textbf{V}_{\text{DDL}}^{2} \\ & + \textbf{C}_{\text{BL}} \times \textbf{V}_{\text{DDH}} \times 0.1 \textbf{V}) + 50 \text{fQ} \times \textbf{V}_{\text{DDH}} \times \textbf{B} \} \end{split}$$

Total power = Power (write) + Power (read)  $\times 2$  (1 write port and 2 read ports)

Fig. A.14 Power estimation of register file with 1 write and 2 read ports.

# B.1. Rohm 0.35-µm process

B.1.1. The 1st SRAM test chip



Fig. B.1 SRAM memory cell design on the 1st test chip.

| Process          | 0.35-μm CMOS 3-metal (Rohm)               |  |  |
|------------------|-------------------------------------------|--|--|
| Organization     | 256 rows $\times$ 256 columns             |  |  |
|                  | (7-transistor cell: 224 columns)          |  |  |
|                  | (6-transistor cell: 32 columns)           |  |  |
| Supply voltage   | 1.5V~3.3V                                 |  |  |
| Memory cell size | $5.45 \times 8.35 \mu\text{m}^2$ (1 cell) |  |  |
|                  | $29.55 \times 8.35  \mu m^2$ (4 cells)    |  |  |
| Total size       | $1,892.7 \times 2,280.65 \mu\text{m}^2$   |  |  |

Fig. B.2 Specification of 64-Kbit SRAM on the 1st test chip.







Fig. B.3 SRAM memory cell design on the 2nd test chip.



Fig. B.4 Circuit schematic of SRAM cells and peripheral circuits.

| Process          | 0.35-µm CMOS 3-metal (Rohm)                                |  |  |
|------------------|------------------------------------------------------------|--|--|
| Organization     | 256 rows × 256 columns                                     |  |  |
|                  | (7-transistor cell: 192 columns)                           |  |  |
|                  | (6-transistor cell: 64 columns)                            |  |  |
| Supply voltage   | 1.5V~3.3V                                                  |  |  |
| Memory cell size | $5.45 \times 9.00 \ \mu m^2$ (1 cell)                      |  |  |
|                  | $25.95 \times 9.00 \ \mu\text{m}^2$ (4 cells)              |  |  |
|                  | $123.0 \times 9.00 \ \mu m^2$ (16 cells + local row coder) |  |  |
| Total size       | $1,970.65 \times 2,470.9 \ \mu m^2$                        |  |  |

Fig. B.5 Specification of 64-Kbit SRAM on the 2nd test chip/

# B.1.3. Register file



Fig. B.6 Circuit schematic of register file memory cell with NMOS switch. Peripheral circuits such as predecoder, write driver and read buffer are the same as that of 64-Kbit SRAM.

## B.1.4. Level converter



Fig. B.7 Circuit schematic of level converters. (a) Conventional level converter. (b)
 Bypass level converter (BLC). (c) Pseudo NMOS level converter. (d)
 Replica-biased level converter. Output node of each level converter is connected to a dummy word-line containing 256 cells.

# B.2. NEC 0.25-µm SOI process

# B.2.1. 2-Kbit SRAM






Fig. B.9 Layout floor plan of 2-Kbit SRAM.



Fig. B.10 Layout size of 2-Kbit SRAM.



Fig. B.11 Circuit schematic of the whole of 2-Kbit SRAM.

| Process             | 0.25-µm SOI 3-metal                      |
|---------------------|------------------------------------------|
| Organization        | 128 words × 16 bits                      |
| Operating frequency | 500MHz                                   |
| Supply Voltage      | 0.5V (address buffer, predecoder)        |
|                     | 1.0V (cell, sense amp, control circuits) |
| Memory Cell Size    | 3.32×7.86 μm <sup>2</sup>                |
| Total size          | 463.48 × 496.16 μm <sup>2</sup>          |
| Pin                 | CK, WE, Address (7bit),                  |
|                     | Data IN (16bit), Data OUT (16bit)        |

Fig. B.12 2-Kbit SRAM specification.



Fig. B.13 Circuit schematic of control signal generators.



Fig. B.14 Circuit schematic of sense amplifier and data line.







Fig. B.16 SRAM memory cell design.



Fig. B.17 Circuit schematic of address buffer.



Fig. B.18 Circuit schematic of predecoder.



Fig. B.19 Circuit schematic of row decoder.



Fig. B.20 Circuit schematic of level converters.



Fig. B.21 Circuit schematic of 2nand predecoder and row decoder.



Fig. B.22 Circuit schematic of 3nand predecoder and row decoder.







Fig. B.24 Simulated delay time of decoders.



Fig. B.25 Circuit schematic of input data line (1<sup>st</sup> stage).



Fig. B.26 Circuit schematic of input data line (2<sup>nd</sup> stage).



Fig. B.27 Circuit schematic of output data line (1<sup>st</sup> stage).



Fig. B.28 Circuit schematic of output data line (2<sup>nd</sup> stage).

## B.2.2. Regiter file



Fig. B.29 Layout of 16-word × 16-bit register file



Fig. B.30 Layout floor plan of 16-word  $\times$  16-bit register file with 1 write and 2 read ports.



Fig. B.31 Circuit schematic of register file cell.



Fig. B.32 Register file cell design.



Fig. B.33 Circuit schematic of predecoder and row decoder of register file.



Fig. B.34 Circuit schematic of predecoder and row decoder of register file.



Fig. B.35 Circuit schematic of 7-transitor SRAM cells and buffers.



Fig. B.36 Circuit schematic of 48-Kbit 7-transistor SRAM cells.

## References

- F. Federico, H. Marcian. E Jr., M. Stanley, and S. Masatoshi, "History of the 4004," IEEE Micro., Vol. 16, pp. 10-20, Dec. 1996
- [2] Kiyoo Itoh, "VLSI Memory Chip Design," Springer-Verlag, 2001
- [3] "International Technology Roadmap for Semiconductors 2001 edition," Semiconductor Industry Association, 2001
- [4] Hiroshi Iwai, "CMOS Technology Year 2010 and Beyond," IEEE J. Solid-State Circuits, vol. 34, pp. 357-366, Mar., 1999
- [5] A.P. Chandrakasan, S. Sheng, and R.W. Brodersen, "Low-Power CMOS Digital Design," IEEE J. Solid-State Circuits, vol. 27, pp. 473-484, Apr. 1992
- [6] T. Sakurai and A.R. Newton, "Alpha-power law MOSFET model and its application to CMOS inverter delay and other formulas," IEEE J. Solid-State Circuits, vol. 25, pp. 584-594, Apr. 1990
- [7] S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu, and J. Yamada, "1-V Power Supply High-Speed Digital Circuit Technology with Multithreshold-Voltage CMOS," IEEE J. Solid-State Circuits, vol. 30, pp. 847-854, Aug. 1995
- [8] K. Fujii, T. Douseki, and M. Harada, "A Sub-1V Triple-Threshold CMOS/SIMOX Circuit for Active Power Reduction," in 1998 Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 1998, pp. 190-191
- [9] T. Inukai, M. Takamiya, K. Nose, H. Kawaguchi, T. Hiramoto, and T. Sakurai, "Boosted Gate MOS (BGMOS): Device/Circuit Cooperation Scheme to Achieve Leakage-Free Giga-Scale Integration," in IEEE 2000 Custom Integrated Circuits Conf. Dig. Tech. Papers, May 2000, pp. 409-412
- [10] J.T. Kao and A.P. Chandrakasan, "Dual-Threshold Voltage Techniques for Low-Power Digital Circuits," IEEE J. Solid-State Circuits, vol. 35, pp. 1009-1018, July 2000
- [11] H. Kawaguchi, K. Nose, and T. Sakurai, "A Super Cut-Off CMOS (SCCMOS) Scheme for 0.5-V Supply Voltage with Picoampere Stand-By Current," IEEE J. Solid-State Circuits, vol. 35, pp. 1498-1501, Oct. 2000
- [12] K. Nose, M. Hirabayashi, H. Kawaguchi, S. Lee, and T. Sakurai, "Vth-Hopping Scheme for 82% Power Saving in Low-Voltage Processors," in IEEE 2001 Custom Integrated Circuits Conf. Dig. Tech. Papers, May 2001, pp. 93-96
- [13] T. Kuroda, T. Fujita, S. Mita, T. Nagamatsu, S. Yoshioka, K. Suzuki, F. Sano, M. Norishima, M. Murota, M. Kato, M. Kinugawa, M. Kakumu, and T. Sakurai, "A 0.9-V, 150-MHz, 10-mW, 4mm<sup>2</sup>, 2-D Discrete Cosine Transform Core Processor with Variable Threshold-Voltage(VT) Scheme," IEEE J. Solid-State Circuits, vol. 31, pp. 1770-1779,

Nov. 1996

- [14] M. Miyazaki, G. Ono, T. Hattori, K. Shiozawa, K. Uchiyama, and K. Ishibashi, "A 1000-MIPS/W Microprocessor using Speed-Adaptive Threshold-Voltage CMOS with Forward Bias," in 2000 Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2000, pp. 420-421
- [15] K. Usami, K. Nogami, M. Igarashi, F. Minami, Y. Kawasaki, T. Ishikawa, M. Kanazawa, T Aoki, M. Takano, C. Mizuno, M. Ichida, S. Sonoda, M. Takahashi, and N. Hatanaka, "Automated Low-power Technique Exploiting Multiple Supply Voltages Applied to a Media Processor," in IEEE Custom Integrated Circuits Conf. Dig. Tech. Papers, May 1997, pp. 131-134
- [16] K. Suzuki, S. Mita, T. Fujita, F. Yamane, F. Sano, A. Chiba, Y. Watanabe, K. Matsuda, T. Maeda, and T. Kuroda, "A 300MIPS/W RISC Core Processor with Variable Supply-Voltage Scheme in Variable Threshold-Voltage CMOS," in IEEE Custom Integrated Circuits Conf. Dig. Tech. Papers, May 1997, pp. 587-590
- [17] "The Technology Behind Crusoe Processors", http://www.transmeta.com/
- [18] M. Yoshimoto, K. Anami, H. Shinohara, T. Yoshihara, H. Takagi, S. Nakao, S. Kayano, and T. Nakano, "A Diveided Word-Line Structure in the Static RAM and Its Application to a 64K Full CMOS RAM," IEEE J. Solid-State Circuits, vol. SC-18, pp. 479-485, Oct. 1983
- [19] T. Hirose, H. Kuriyama, S. Murakami, K. Yuzuriha, T. Mukai, K. Tsutsumi, Y. Nishimura, Y. Kohno, and K. Anami, "A 20-ns 4-Mb CMOS SRAM with Hierarchical Word Decoding Architecture", IEEE J. Solid-State Circuits, vol. 25, pp. 1068-1074, Oct. 1990
- [20] K. Ishibashi, K. Takasugi, T. Yamanaka, T. Hashimoto, and K. Sasaki, "A 1V TFT-Load SRAM using a Two-Step Word-Voltage Method," " in 1992 Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 1992, pp. 206-207
- [21] H. Norimura and N. Shibata, "A 1-V 1-Mb SRAM for Portable Equipment," in 1996 Symp. on Low Power Electronics and Design Dig. Tech. Papers, Aug. 1996, pp. 61-66
- [22] K. Ishibashi, K. Takasugi, K. Komiyaji, H. Toyoshima, T. Yamanaka, A. Fukami, N. Hashimoto, N. Ohki, A. Shimizu, T. Hashimoto, T. Nagano, and T. Nishida, "A 6-ns 4-Mb CMOS SRAM with Offset-Voltage-Insensitive Current Sense Amplifiers", IEEE J. Solid-State Circuits, vol. 30, pp. 480-486, Apr. 1995
- [23] K. Itoh, A.R. Fridi, A. Bellaouar, and M. I. Elmasry, "A Deep Sub-V, Single Power-Supply SRAM Cell with Multi-V<sub>T</sub>, Boosted Storage Node and Dynamic Load," in 1996 Symp. on VLSI Circuits Dig. Tech. Papers, June 1996, pp. 132-133
- [24] H. Yamauchi, T. Iwata, H. Akamatsu, and A. Matsuzawa, "A 0.8 V/100 MHz/sub-5 mW-Operated Mege-Bit SRAM Cell Architecture with Charge-Recycle Offset-Source

Driving (OSD) Scheme," in 1996 Symp. on VLSI Circuits Dig. Tech. Papers, June 1996, pp. 126-127

- [25] H. Mizuno and T. Nagano, "Driving Source-Line Cell Architecture for Sub-1-V High-Speed Low-Power Applications", IEEE J. Solid-State Circuits, vol.31, No.4, pp. 552-557, Apr., 1996
- [26] M. Ukita, S. Murakami, T. Yamagata, H. Kuriyama, Y. Nishimura and K. Anami, "A Single-Bit-Line Cross-Point Cell Activation (SCPA) Architecture for Ultra-Low-Power SRAM's", IEEE J. Solid-State Circuits, vol. 28, pp. 1114-1118, Nov. 1993
- [27] K. Sasaki, K. Ishibashi, T. Yamanaka, N. Hashimoto, T. Nishida, K. Shimohigashi, S. Hanamura, and S. Honjo, "A 9-ns 1-Mbit CMOS SRAM," IEEE J. Solid-State Circuits, vol. 24, pp. 1219-1225, Oct. 1989
- [28] K. Sasaki, K. Ishibashi, K. Ueda, K. Komiyaji, T. Yamanaka, N. Hashimoto, H. Toyoshima, F. Kojima, and A. Shimizu, "A 7ns 140mW 1Mb CMOS SRAM with Current Sense Amplifier," in 1992 Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 1992, pp. 208-209,
- [29] K. Osada, H. Higuchi, K. Ishibashi, N. Hashimoto, and K. Shiozawa, "A 2ns Access, 285MHz, Two-Port Cache Macro using Double Global Bit-Line Pairs," in 1997 Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 1997, pp. 402-403
- [30] N. Shibata, H. Morimura, and M. Watanabe, "A 1-V, 10-MHz, 3.5-mW, 1-Mb MTCMOS SRAM with Charge-Recycling Input/Output Buffers," IEEE J. Solid-State Circuits, vol. 34, pp. 866-877, June 1999
- [31] F. Hamzaoglu, Y. Ye, A. Keshavarzi, K. Zhang, S. Narendra, S. Borkar, M. Stan, and V. De, "Dual- V<sub>T</sub> SRAM Cells with Full-Swing Single-Ended Bit Line Sensing for High-Performance On-Chip Cache in 0.13um Technology Generation", International Symposium on Low Power Electronics and Design, pp. 15-19, 2000
- [32] N. Shibata, H. Morimura, and M. Harada, "1-V, 100-MHz Embedded SRAM Techniques for Battery-Operated MTCMOS/SIMOX ASICs," IEEE J. Solid-State Circuits, vol. 35, pp. 1396-1407, Oct. 2000
- [33] T. Douseki, N. Shibata, and J. Yamada, "A 0.5-1V MTCMOS/SIMOX SRAM Macro with Multi-Vth Memory Cells," in IEEE International SOI Conf., Oct. 2000, pp. 24-25
- [34] Kenneth W. Mai, Toshihiko Mori, Bharadwaj S. Amrutur, Ron Ho, Bennett Wilburn, Mark, A. Horowitz, Isao Fukushi, Tetsuo Izawa, and Shin Mitarai, "Low-Power SRAM Design Using Half-Swing Pulse-Mode Techniques," IEEE J. Solid-State Circuits, vol. 33, pp. 1659-1671, Nov. 1998
- [35] Evert Seevinck, Frans J. List, and Jan Lohstroh, "Static-Noise Margin Analysis of MOS SRAM Cells," IEEE J. Solid-State Circuits, vol. SC-22, pp. 748-754, Oct. 1987
- [36] Azeez J. Bhavnagarwala, Xinghai Tang, and James D. Meindl, "The Impact of Intrisic

Device Fluctuations on CMOS SRAM Cell Stability," IEEE J. Solid-State Circuits, vol. 36, pp. 658-665, Apr. 2001

- [37] H. Kawaguchi, Y. Itaka, and T. Sakurai, "Dynamic Leakage Cut-Off Scheme for Low-Voltage SRAM's," in 1998 Symp. on VLSI Circuits Dig. Tech. Papers, June 1998, pp. 140-141
- [38] M. Matsui, H. Momose, Y. Urakawa, T. Maeda, A. Suzuki, N. Urakawa, K. Sato, J. Matsunaga, and K. Ochii, "An 8-ns 1-Mbit ECL BiCMOS SRAM with Double-Latch ECL-to-CMOS-Level Converters," IEEE J. Solid-State Circuits, vol. 24, pp. 1226-1232, Oct. 1989
- [39] Y. Nakagome, K. Itoh, M. Isoda, K. Takeuchi, and M. Aoki, "Sub-1-V Swing Internal Bus Architecture for Future Low-Power ULSI's," IEEE J. Solid-State Circuits, vol. 28, pp. 414-419, Apr. 1993
- [40] M. Hiraki, H. Kojima, H. Misawa, T. Akazawa, and Y. Hatano, "Data-Dependent Logic Swing Internal Bus Architecture for Ultralow-Power LSI's," IEEE J. Solid-State Circuits, vol. 30, pp. 397-402, Apr. 1995
- [41] Hui Zhang, Jan Rabaey, "Low-Swing Interconnect Interface Circuits", in 1998 International Symp. on Low Power Electronics and Design Dig. Tech. Papers, 1998, pp. 161-166
- [42] Sung Mo Kang and Yusuf Leblebici, "CMOS Digital Integrated Circuits Analysis and Design," McGraw-Hill, 1999
- [43] Creigton Asato, "A 14-Port 3.8-ns 116-Word 64-b Read-Renaming Register File," IEEE J. Solid-State Circuits, vol. 30, pp. 1254-1258, Nov. 1995
- [44] Yong Moon and Deog-Kyoon Jeong, "A 32 × 32-b Adiabatic Register File with Supply Clock Generator," IEEE J. Solid-State Circuits, vol. 33, pp. 696-701, May. 1998

## Acknowledgements

I am grateful to express my special thanks to Professor Takayasu Sakurai, the Institute of Industrial Science (IIS), the University of Tokyo, for providing valuable directions, useful advice and general support of this work.

I would like to thank Mr. Kawaguchi and Mr. Inagaki for arranging research environment and valuable comments. I would like to thank Mr. Nose for providing lots of useful advice. And I would like to thank Mr. Kanda for co-working at test chip design and useful discussions.

I am grateful to all the members of Sakurai laboratory for making my research life enjoyable.

The chip fabrication is supported by VLSI design and Education Center (VDEC), the University of Tokyo with the collaboration by Rohm Corporation, and also supported by NTT electronics Corporation.

## List of Publications

- [1] 服部 貞昭, 神田 浩一, 桜井 貴康, "高速レベルコンバータ", 2001年秋季 第62回応用物理 学会学術講演会, pp. 703
- [2] 服部 貞昭, 桜井 貴康, "低振幅ビット線方式を用いた低消費電力高速 SRAM", 2001年 電子 情報通信学会 エレクトロニクスソサイエティ大会, C-12-38, pp. 99
- [3] Sadaaki Hattori and Takayasu Sakurai, "90% Write Power Saving SRAM Using Sense-Amplifying Memory Cell", 2002 Symposium on VLSI Circuits (accepted)