## WP 4.6: 200MHz Video Compression Macrocells Using Low-Swing Differential Logic

Masataka Matsui, Hiroyuki Hara, Katsuhiro Seta, Yoshiharu Uetani, Lee-Sup Kim<sup>2</sup>, Tetsu Nagamatsu, Takayoshi Shimazawa, Shinji Mita, Goichi Otomo, Takeshi Oto, Yoshinori Watanabe<sup>1</sup>, Fumihiko Sano<sup>1</sup>, Akihiko Chiba<sup>1</sup>, Kouji Matsuda<sup>1</sup>, Takayasu Sakurai

Toshiba Corp. / 'Toshiba Microelectronics Corp., Kanagawa, Japan 2Now with Korea Advanced Institute of Science & Technology, Seoul, Korea

Improving the performance of fully dedicated macrocells is key to realizing HDTV-resolution video de/compression LSIs operating at more than 100MHz, having reasonable power consumption and chip size small enough for consumer applications. Existing circuit techniques are either not sufficiently fast or are area consuming. Figure 1a is the sense-amplifying pipeline flip-flop (SA-F/F) scheme. The SA-F/F amplifies differential reduced inputs (D,  $\overline{D}$ ) and latches data like a pipeline register, synchronous to a single phase clock (CLK). As shown in Figure 1b, the differential nodes, D,  $\overline{D}$  are pre-discharged to ground during active  $\Phi_{-}$ .

Figure 2 shows a 4b carry-skip adder using the SA-F/F scheme. The carry propagation is about 20 times faster than that of the conventional Manchester carry chain because the SA-F/F detects a 100mV input difference ( $\Delta$ Vin). Adders wider than 32b are constructed by serially connecting 4b adders without additional area-consuming speed-up circuits such as carry look-ahead (CLA). Latch timing optimization of the sense-amp is not necessary, as it is with ordinary reduced-voltage-swing circuits. This is because the SA-F/F utilizes the system clock itself as a latch signal. Critical timing is not needed and the timing margin is always optimized. The amplifying time of the SA-F/F - of the order of 1ns - is not included in the addition time but counted in clock-to-data-out delay of the pipeline register. This time is not usually in the critical paths.

Since the differential input voltage of the SA-F/F is about 100mV, the threshold voltage drop by nMOS pass-transistors and pull-up transistors does not hinder the function of the SA-F/F even in low-voltage operation. The area penalty of nMOS differential trees compared to that of ordinary CMOS gates is small because only nMOS transistors are used. For a 20b adder, the circuit with no additional CLA has about 30% area and 50% speed advantages over conventional CMOS implementation. Since the current-mode latch sense-amp employed in the SA-F/F does not consume dc power and the voltage swing is reduced in high-speed operation, the SA-F/F scheme is more efficient than conventional CMOS.

The SA-F/F scheme is applied to two hand-crafted macrocells that are key for video de/compression LSIs. The first is a discrete cosine transform (DCT) processor executing twodimensional  $8 \times 8$  DCT and inverse DCT(IDCT). The macrocell has a parallel architecture based on distributed arithmetic and a fast DCT algorithm that delivers high throughput DCT/ IDCT processing of one pixel per clock [1]. In the DCT macro, a one-dimensional linear transformation of the form

$$Y = \sum_{k=0}^{3} C_k X_k \quad (C_k: \text{DCT coefficient}, X_k = \sum_{\substack{n=0 \ i \neq k}}^{15} x_{kn} \times 2^{-n} : \text{input data}$$
$$: x_{kn} = 0, 1 (n := 0)$$
$$: x_{kn} = 0, 1 (n = 0) : \text{sign bit of } 2^i \text{s complement} )$$

is, by iterative multiplication and accumulation (MAC),

A STATE OF A STATE

$$i = \left[\sum_{k=0}^{3} C_k x_{k(2i)}\right] + 2^{i_1} \left[\sum_{k=0}^{3} C_k x_{k(2i+1)}\right] + 2^{i_2} Y_{i+1} \quad (i = 7, 6, ..., 0, Y = Y_0, Y_0 = 0)$$

Figure 3 is a block diagram of the MAC in the DCT/IDCT macro implementing the above equation. Partial products of the form  $\Sigma C_k x_{kn}$  are derived from two 16bx16word two table-look-up ROMs. A 20b differential carry skip adder with the SA-F/F is the final adder. Owing to the speed of the SA-F/F scheme, no pipeline latch is required in the MAC stage. The DCT macro requires 16 MAC units occupying 60% of the total macro area. Because the 20b adders with the SA-F/F have a smaller area, the overall macro size is reduced by 15% compared to conventional CMOS implementation.

The second macro is a variable-length decoder (VLD) that decodes a variable-length code in one clock-cycle regardless of code length [2]. The VLD macro is composed of a head-shift unit, a code look-up table unit and a rear code decoder unit. Figure 4 is a VLD block diagram. The head-shift unit shifts the input bitstream in multiple bits (from 0 to 31b) in every clock cycle by the previously decoded code length. It outputs the 32b word to be decoded in the current cycle. The 31b maximum code length is sufficient to decode ISO/MPEG1 and ISO/ MPEG2 bit streams. The head shift unit also includes a 3-word FIFO to buffer input data segments. The code table unit matches patterns to decode to a symbol, it outputs the decoded symbol and the code length of the symbol.

The head-shift unit includes two 64b-to-32b serial-connected barrel shifters critical in determining speed. The shifting circuits have differential nMOS pass-transistors, with outputs received by the SA-F/Fs as shown in Figure 5. For simplicity, 4-to-2 barrel shifters are depicted in the figure. The two barrel shifters are merged in layout. Owing to the highspeed SA-F/F scheme, no sense-amp is needed between the first barrel shifter (BSO) outputs and the second barrel shifter (BS1) inputs, reducing area.

The DCT and VLD test chips are in 0.8  $\mu$ m double-metal CMOS technology. 0.5  $\mu$ m nMOSFETs and 0.6  $\mu$ m pMOSFETs are used for 3.3V operation. Die micrographs are shown in Figure 6. The 120k-transistor DCT macro occupies 13.3 mm². In the 28k-transistor VLD macro, implemented with a preliminary MPEG2 table, the head-shift unit occupies 2.7 mm² whereas the lookup table and the rear code decoder is 2.3 nm²[3]. The area for the look-up table depends on table configuration. Features of the macros are summarized in Table 1. Figure 7 shows simulated MAC waveforms of the DCT macro. 200MHz operation is observed at 3.3V power supply with power consumption of 0.35W and 100MHz operation is attainable at 2V with power consumption of 0.15W. These two macros are used in a video decoder LSI that decompresses MPEG2 bit streams for HDTV-resolution signals or a fast JPEG processor.

### Acknowledgments

Discussions and encouragement by N. Kai, T. Odaka, A. Parameswar, K. Maeguchi, K. Kanzaki, S. Suzuki, S. Sasaki, S. Kohyama, H. Nakatsuka and Y. Unno are appreciated.

#### References

 Uramoto, S., et al.," A 100-MHz 2-D Discrete Cosine Transform Core Processor," Tech. Dig. Symp. VLSI Circuits, pp. 35-36, May, 1991.

[2] Sun, M.-T., et al., "A High-Speed Entropy Decoder for HDTV," Proc. IEEE CICC, pp. 26.3.1, 1992.

[3] ISO/MPEG, MPEG-2 Test Model 3 (TM3), Nov. 1992.

THE TOP SOULD





ā,a

**(b)** 



## Figure 2: Four-bit carry skip adder using SA-F/F scheme.



Figure 3: Block diagram; MAC unit in DCT/IDCT macro.



Figure 4: VLO macro block diagram.



Figure 5: 4-to-2 barrel shifters in VLD macro. Figure 6: See page 314.



Figure 7: Simulated waveforms; MAC in DCT macro.

# DCT/IDCT (2-dimensional)

| Block size      | 8x8 fixed                            |
|-----------------|--------------------------------------|
| Data format     | 9b signed (pixel), 12b signed (DCT)  |
| Latency         | 112 clocks                           |
| Throughput      | 64 clocks/block                      |
| Accuracy        | CCITT H.261 compatible               |
| VLD             | -                                    |
| Code table      | ISO/MPEG2 Test Model 3 (incl. MPEG1) |
| Input bitstream | 32b (MSB first)                      |
| Max code length | 31b                                  |
| Throughput      | 1 symbol/clock                       |
|                 |                                      |

Table 1: Macro features.







WP 4.6: 200MHz Video Compression Macrocells Using Low-Swing Differential Logic (Continued from page 77)



Figure 6: Chip micrographs: (above) DCT, (right) VLD.



314 e EEE memoional Sciin State Circuite Conference