# on Electronics VOL.E79-C NO.6 JUNE 1996 A PUBLICATION OF THE ELECTRONICS SOCIETY The Institute of Electronics, Information and Communication Engineers Kikai-Shinko-Kaikan Bldg., 5-8, Shibakoen 3chome, Minato-ku, TOKYO, 105 JAPAN PAPER Special Issue on ULSI Memory Technology # Special and Embedded Memory Macrocells for Low-Cost and Low-Power in MPEG Environment Hiroyuki HARA<sup>†</sup>, Nonmember, Masataka MATSUI<sup>†</sup>, Goichi OTOMO<sup>†</sup>, Katsuhiro SETA<sup>†</sup>, and Takayasu SAKURAI<sup>†</sup>, Members Special memory and embedded memories used in a newly designed MPEG2 decoder LSI are described. Orthogonal memory, which has a functionality of parallel-to-serial transposition, is employed in a IDCT (Inverse Discrete Cosine Transform) block for small area and low-power. The orthogonal memory realizes the special purpose with 50% of the area and the power compared with using flip-flop array. FIFO's and other dual-port memories are designed by using a single-port RAM operated twice in one clock cycle to reduce cost. Flip-Flop cell is one of the important memory elements in the MPEG environment, and is also improved for the low-cost optimizing functionality for video processing. The area and power of the fabricated MPEG2 decoder chip are reduced by 20% using these techniques. As for testability, direct test mode is implemented for small area. An instruction RAM is placed outside the pad area in parallel to a normal instruction ROM and activated by Al-masterslice for extensive debugging and an early sampling. Other memory related techniques and the key features of the decoder LSI are also described. **key words:** MPEG2 decoder LSI, compression/decompression, orthogonal memory, embedded memory, on-chip memory testability #### 1. Introduction An LSI which decodes bitstreams based on MPEG2 (Moving Picture Experts Group) [1] standard is a key component to realize multimedia applications such as digital video disc. This is because MPEG2 sufficiently covers current TV-rate video signals in image data compression. Recently, advancement of process technology made it possible to realize single chip MPEG2 decoder LSI's [2], [3]. However, further efforts are required to reduce chip size since it is a cost-sensitive consumer product. Moreover, from the viewpoint of cost, low-power is significant because it depends on power consumption whether an LSI can be assembled into plastic package or not. Testabiliy is important as well because the size of the test circuits as well as testing time affects the chip cost. To meet the requirements, the MPEG2 decoder usually adopts dedicated LSI design rather than programmable one like generalpurpose DSP's. The dedicated LSI is constructed with functional blocks such as motion compensation block, inverse quantization block, discrete cosine transform (DCT) block and so on which are highly optimized for a specific operation. Figure 1 shows memory macrocells associated with the MPEG2 decoder [3]. Most of the functional blocks use their own dedicated memory macrocells and consequently the memory macrocells are rather small, are distributed on a chip and occupy roughly 50% of the total core size of the chip. Therefore, quality of memory macrocell design significantly affects the chip performance. This paper describes the on-chip memory related techniques to achieve these requirements, which can be considered as a review of the state-of-the-art memory macrocell techniques applicable to other logic LSI's. The next section describes design techniques for memory macrocell: a specially designed memory, special usage of general-purpose embedded memories, an area-efficient flip-flop. The testability of the memory macrocells and debugging methodology are given in Sect. 3. The implementation of the MPEG2 decoder LSI is described in Sect. 4 followed by the conclusions in the final section. Data Bus Add. Bus | Fig. 1 Block diagram of MPEG2 decoder LSI. Manuscript received November 28, 1995. Manuscript revised January 24, 1996. <sup>†</sup> The authors are with Semiconductor Device Engineering Laboratory, TOSHIBA CORPORATION, Kawasakishi. 210 Japan. <sup>††</sup> The author is with Semiconductor Group, TOSHIBA CORPORATION, Kawasaki-shi, 210 Japan. ## 2. On-Chip Memory Design Techniques #### 2. 1 Orthogonal Memory Various standards including MPEG2/1, CCITT H. 261 [4], JPEG (Joint Photograph Experts Group) [5] have adopted DCT (discrete cosine transform)-based cod-Designed DCT macrocell [6] executes a two dimensional 8×8 DCT and inverse-DCT (IDCT), which delivers high throughout DCT / IDCT processing of one pixel per clock. The macrocell has a regularized parallel architecture based on distributed arithmetic, which is well known as an effective algorithm to implement linear product calculation. Direct implementation of DCT/IDCT matrix operation use a lot of multiplier-accumulators (MAC's), which consumes area. In the distributed arithmetic algorithm, the MAC is realized using ROM's, shifters and accumulators with no hardware multiplier. This means the DCT macrocell size is reduced. In order to implement the distributed arithmetic algorithm, an input buffer is required for parallel-to-serial transposition. The input buffer on the DCT macrocell is illustrated in Fig. 2. Eight 16 bit input data D0, D1, ..., D7 is stored sequentially into the input buffer with bit-parallel structure. The stored data should then be read out as 4 bit address for a ROM in bit-slice structure with the least significant bit first, and can be decomposed into two groups according to Chen's method [7], which drastically reduces the number of word line in the ROM from 256 (8 bit) to 32 (4 bit $\times$ 2 groups). The DCT processors previously reported [8], [9] realized the input buffers using flip-flop array with shift-register function and multiplexers (Fig. 3), which consumes much area because generally a flip-flop is one of the largest primitive cells in a standard cell library. Another implementation of the input buffer uses a special purpose memory, which is named "orthogonal memory." The orthogonal memory whose the circuit diagram is shown in Fig. 4 (a) realizes the above-mentioned functionality. In the orthogonal memory, word lines and bit lines run both vertically Fig. 2 Input buffer structure for DCT macrocell. and horizontally as shown in the circuit diagram. Input data are written to memory cells selected by the horizontal word line (HWL) through the differential vertical bit line (VBL). In the read operation, after the horizontal bit line (HBL) pairs are precharged, the stored data are read out by sense-amplifying differential voltage on the horizontal bit lines. The data are read out and used as the ROM address inputs with a latency of 8 cycles. In the DCT macrocell, two banks of the orthogonal memory which consists of a memory cell array of $8 \text{ word} \times 16 \text{ bit are required.}$ For the purpose of Fig. 3 Input buffer built with flip-flops. Fig. 4 Orthogonal memory (a) Circuit diagram. (b) Iterleaved array structure. Fig. 5 Photomicrograph of orthogonal memory. further area reduction, the two banks are laid out in an interleaved manner (Fig. 4 (b), in which two memory cells from the two banks in the same bit position share a VBL and an HBL. The interleaved array structure automatically means that two banks share both readout and write-in circuits including sense-amplifiers, write-buffers, precharge circuits. Therefore, the DCT input buffer with interleaved orthogonal memory is much smaller than that with two simple orthogonal memories. The reason why such an array structure can be possible is that two banks never be in read operation nor in write operation simultaneously. As also shown in Fig. 4 (b), the memory cells aligned in the same horizontal row are tied to one of the two HBLs: the odd HBL and the even HBL dependent on whether the cell belongs to an odd digit or to an even digit. The structure makes it possible to read out 16 bits in 8 cycles, which keeps the throughput of the input buffer constant. Microphotograph of the orthogonal memory implementing 2 bank $\times$ 8 word $\times$ 16 bit, is shown in Fig. 5. The macrocell size is approximately 420 $\mu$ m $\times$ 760 $\mu$ m with a memory cell size of 10.8 $\mu$ m $\times$ 32.0 $\mu$ m. The memory realized the above-mentioned functionality with less than 50% of the area and the power which would be needed if the IDCT input buffer was built with flip-flops. # 2. 2 Embedded Memories for Low-Cost and Low-Power In the MPEG2 decoder LSI, several kind of internal clocks are utilized to input the compressed video bitstream, to store the reference picture data in an external DRAM, and to output the decoded display data. Choosing the suitable clock for processing quantity in each functional block makes it possible to minimize the chip power and chip area. However, a FIFO buffer is required for a rate control between two blocks operated by the different clocks. The FIFO buffer is usually constructed with a dual-port memory, Fig. 6 Realizing dual-port memory with a single-port memory (FIFO case). (a) Single-port implementation. (b) Timing chart. (c) Dual-port implementation. which has independent two port for read and write operation. The dual-port memory is also utilized for the implemented display filter. The FIFO's and other dual-port memories are designed by using a single-port RAM operated twice in one clock cycle to reduce area as shown in Fig. 6. Since a single-port memory cell is half as small as a dual-port memory cell, the single-port RAM operated twice in one clock cycle achieves same functionality with a half area. All memory blocks are synchronous self-timed macrocells and contain address pipeline latches for both read and write operation. The appropriate address apply to the single-port RAM address through multiplexers controlled by the clock signal. The read operation is activated during the high clock level and the write operation during low level respectively. In order to realize the rate control on the FIFO buffer, two times operation per one clock cycle are valid or not by read/write enable signals for each memory macrocells. For example two times read and one time write operation toward the FIFO buffer, the read enable (RE) signal is activated in both cycles, on the other hand, the write enable (WE) signal is only activated in the second cycle. Otherwise, the timing design needs more time, since the lengths of the interconnections between latches and a decoder vary from bit to bit. However, this technique does not achieve low-power of the memory macrocell. In a digital video LSI, total power consumption of the memories becomes large because of implementing many memory macrocells with wide bit width in spite of small capacities. Memory power management is carried out using memory macrocell enable (ME) signal when the memory macrocell is not accessed. Using the memory macrocell enable signal makes it possible to reduce the total memory power effectively, since these memory Fig. 7 Optimized flip-flop. macrocells can be disabled comparatively due to the multi-processing. ## 2. 3 Flip-Flop Flip-Flop (F/F) is one of the memory elements in logic LSI's. Since digital video LSI's with many pipeline stages tend to employ several thousand of F/F's on a chip, the design of the F/F is crucial for small area and low-power. An optimized F/F with hold capability is designed whose circuit diagram is shown in Fig. 7. The hold capability is required for the almost all F/F's in the digital video LSI. Combining a D-F/F with a multiplexer used for the hold function in one cell improves the layout of the optimized F/F. The multiplexer is accommodated to first stage in the D-F/F. Moreover, the transistors filling with gray in Fig. 7 are optimized smaller transistor sizes for low-power maintaining speed needed for the MPEG2 decoding. Especially minimizing the transistors related to clocks (CLK, $\phi$ , $\phi$ /) can achieve the low-power effectively due to high activation of the clocks, and the output (Q) inverter adopts typical transistor size to correspond with output drive capability of the other ASIC cells. 40% smaller power and area are realized compared with a normal ASIC F/F. The optimized F/F also makes it possible to minimize the chip area since the resources for routing are increased by reducing the number of terminals on the F/ F. #### 3. Testing Issues Establishing full testability of on-chip memories without much overhead is another important issue. Table 1 compares three on-chip memory test strategies, i.e. a BIST (Built-In Self Test), a scan test, and a direct test. The direct test mode where all memories can be directly accessed from outside in a test mode is implemented because of its inherent small area. In a test mode, DRAM interface pads are turned into test pins and can access to each memory block through internal data bus and address bus as shown in Fig. 1 and 8. Multiplexers and tri-state buffers related the memory test filling gray in Fig. 8 are included among the respective memory macrocells, which means that the increase for these test Table 1 Comparison of various memory test strategies. | Items | Direc | t Scan | BIST | |------------------------|-------|----------|----------| | Area | 0 | Δ | х | | Test Time | 0 | x | 0 | | Pattern Control | 0 | 0 | x | | <b>Bus Capacitance</b> | Δ | 0 | 0 | | At-speed Test | 0 | x | 0 | | O : Good | | ∆ : Fair | x : Poor | Fig. 8 Direct test architecture for embedded memories. Fig. 9 Instruction RAM masterslice for code debugging. Fig. 10 Chip photograph of instruction RAM enabled. circuits is a negligible. The present MPEG2 decoder contains a RISC whose firmware is stored in an on-chip ROM. In order to make the debugging easy and extensive, an instruction RAM is put outside the pads in parallel to the instruction ROM and activated by an Al-masterslice in an initial debugging stage as illustrated in Fig. 9. (See Fig. 10 for a chip photograph). For a sample a chip mounted in a plastic package, the instruction RAM is cut out by a scribe line. This scheme enables the extensive debugging and the early sampling at the same time for firmware-ROM embedded LSI's. #### 4. MPEG2 Decoder Figure 11 shows a microphotograph of the designed MPEG2 decoder LSI, which can decode MPEG2 bitstreams of the main profile and the main level without any external processor. A 7 Kbit line memory is included on the chip to implement 5 horizontal filters. The key features of the chip are summarized in Table 2. Chip size is 12.5 mm $\times$ 12.5 mm fabricated using 0.8 $\mu$ m CMOS technology. Power consumption of the chip is 1.2 W at 27 MHz under 3.3 V supply voltage. The chip is smaller and consumes less power than the formerly reported MPEG2 decoder thanks to the techniques for memory macrocells and flip-flops presented here. In the designed MPEG2 decoder LSI, the internal clock of 27/18/13.5 MHz are utilized to minimize the chip power and chip area. The single-port memories operated twice in these clocks are realized enough by using available process technology. These memory macrocells for the FIFO's and the filter functionality occupied about 10% of the area on the MPEG2 decoder chip are achieved a half area with the presented tech- Fig. 11 Microphotograph of MPEG2 decoder LSI. Table 2 Key features of MPEG2 decoder LSI. | Technology | 0.8μm CMOS | |----------------|-----------------------------| | Chip Size | 12.5mm x 12.5mm | | Power | 1.2 W | | Supply Voltage | 3.3 V | | Package | 160pin QFP | | Function | MPEG2 decoder (MP@ML) | | Clock Freq. | 27MHz (Internal system clk) | nique. However, the power consumption of the singleport memory operated twice is equal to that of the dual-port memory. All memory macrocells distributed on the chip are realized 30% smaller power consumption using the memory macro enable signal diligently, and 13% smaller power consumption of the MPEG2 decoder chip is obtained. The MPEG2 decoder LSI implements several thousand of flip-flops, which occupy 23% of the area on the chip. Therefore, about 10% power and area reduction of the MPEG2 chip is realized effectively by using the optimized flip-flop. Two orthogonal memories are used in the DCT macrocell to perform the distribute arithmetic algorithm for the small area and power. The orthogonal memory is useful in the MPEG environment if it is utilized constructively in many blocks needed the functionality of parallel-to-serial transposition. #### 5. Conclusions On-chip memory related techniques to achieve low-cost and low-power was described considering the features of the digital video LSI's. The orthogonal memory realizes the functionality of parallel-to-serial transposition with 50% of the area and the power which would be needed if it were built with flip-flop array. The orthogonal memory is employed in the DCT macrocell for small area and low-power. The memory macrocells and the flip-flops were also improved to achieve the low-cost and the low-power. The FIFO's and dual-port memories which are indispensable for video processing LSI's were designed by using the single-port RAM operated twice in one clock cycle to reduce area. Power management for the all memory macrocells is carried out using the memory macrocell enable signal when the memory macrocell is not accessed. The flip-flop with hold capability employing small size transistors was proposed to achieve the small chip area and chip power effectively. The techniques relating to the memory macrocell and the flip-flop can achieve 20% area and 23% power reduction of the fabricated MPEG 2 decoder chip. In terms of testability, the direct test mode is implemented for the small area. The instruction RAM is placed outside the pad area in parallel to the normal instruction ROM in order to easy debugging and early sampling. These techniques are well implemented to the MPEG2 decoder LSI to realize the low-cost and the low-power, and are applicable to other logic LSI's. ## Acknowledgment The authors would like to thank T. Odaka, T. Oto, K. Kitagaki, K. Maeguchi, S. Suzuki, S. Sasaki, S. Kohyama, K. Kanzaki, H. Nakatsuka and Y. Unno for their useful discussions and constant encouragement, T. Nagamatsu, H. Muraoka, T. Mori, T. Shimazawa, K. Matsuda, Y. Watanabe, F. Sano, A. Chiba, S. Ishiwata, S. Michinaka and T. Demura for their helpful discussions and the implementation of the chip. #### References - D. Le Gall, "MPEG: A video compression standard for multimedia applications," Commun. ACM, vol. 34, no. 4, pp. 47-58, April 1991. - [2] T. Demura, T. Oto, K. Kitagaki, S. Ishiwata, G. Otomo, S. Michinaka, S. Suzuki, N. Goto, M. Matsui, H. Hara, T. Nagamatsu, K. Seta, T. Shimazawa, K. Maeguchi, T. Odaka, Y. Uetani, T. Oku, T. Yamakage, and T. Sakurai," A single-chip MPEG2 video decoder LSI, " in ISSCC Dig. Tech. Papers, pp. 72-73, Feb. 1994. - [3] G. Otomo, H. Hara, T. Oto, K. Seta, K. Kitagaki, S. Ishiwata, S. Michinaka, T. Shimazawa, M. Matsui, T. Demura, M. Koyama, Y. Watanabe, F. Sano, A. Chiba, K. Matsuda, and T. Sakurai, "Special memory and embedded memory macro in MPEG environment," Proc. CICC, pp. 139-142, May 1995. - [4] M. Liou, "Overview of the px64 kbit/s video coding standard," Commun. ACM, vol. 34, no. 4, pp. 60-63, April 1991. - [5] G. K. Wallace, "The JPEG still-picture compression standard," Commun. ACM, vol. 34, no. 4, pp. 31-44, April 1991. - [6] M. Matsui, H. Hara, K. Seta, Y. Uetani, L. S. Kim, T. Nagamatsu, T. Shimazawa, S. Mita, G. Otomo, T. Oto, Y. Watanabe, F. Sano, A. Chiba, K. Matsuda, and T. Sakurai," 200 MHz video compression macrocells using low-swing differential logic," in ISSCC Dig. Tech. Papers, pp. 76-77, Feb. 1994. - [7] W. H. Chen, C. H. Smith, and S. C. Fralick, "A fast computational algorithm for the discrete cosine transform," IEEE Trans. Commun., vol. COM-25, no. 9, pp. 1004-1009, Sept. 1977. - [8] M. T. Sun, T. C. Chen, and A. M. Gottlieb, "VLSI implementation of a 16x16 discrete cosine transform," IEEE Trans. Circuits & Syst. Circ. Sys., vol. 36, no. 4, pp. 610-617, April 1989. - [9] S. Uramoto, Y. Inoue, A. Takabatake, J. Takeda, Y. Yamashita, H. Terane, and M. Yoshimoto, "A 100-MHz 2-D discrete cosine transform core processor," Tech. Dig. Symp. VLSI Circ., pp. 35-36, May 1991; also IEEE J. Solid-State Circ., vol. 27, no. 4, pp. 492-499, April 1992. Hiroyuki Hara was born on November 19, 1960 in Tokyo, Japan. He received the B.S. degree in electronic engineering from Shibaura Institute of Technology, Tokyo, Japan in 1983. In 1983, he joined Toshiba Corporation, Kawasaki, Japan, where he was engaged in the development and design of bipolar and BiCMOS LSI. He is now in Toshiba's Semiconductor Device Engineering Laboratory, where he has been engaged in the research and development of BiCMOS macrocells for high performance ASIC's. He has been working on the development of DCT macrocell and video compression/decompression LSI's. His present interests include low-power designs. Masataka Matsui was born in Tokyo, Japan, on August 30, 1960. He received the B.S. and M.S. degrees in electronic engineering from the University of Tokyo, Tokyo, Japan, in 1983 and 1985, respectively. In 1985 he joined the Semiconductor Device Engineering Laboratory, Toshiba Corporation, Kawasaki, Japan, where he was engaged in the research and development of static memories including I Mbit CMOS SRAM and 1 Mbit BiCMOS SRAM, BiCMOS ASIC's, and video compression/decompression LSI's. Since 1993, he was a visiting scholar at Stanford University, where he is working on low-power LSI design. He has currently designed media processors and video compression/decompression LSI's. Goichi Otomo was born in Ibaraki, Japan, in 1964. He received the B.S. and M.S. degrees in electronic engineering from Ibaraki University, Japan, in 1987 and 1989, respectively. In 1989 he joined the Toshiba ULSI Research Center, where he was involved in the research and development of the digital signal processors. Since 1994 he has been working at the Semiconductor Device Engineering Laboratory, Toshiba Corporation, where he has been engaged in the development of the MPEG2 decoder systems. Katsuhiro Seta was born in Tokyo, Japan, on September 9, 1963. He received the B.S. and M.S. degrees in electric engineering from Science University of Tokyo, Japan, in 1987 and 1989, respectively. He joined the Semiconductor Device Engineering Laboratory, Toshiba Corporation, Kawasaki, Japan, in 1989, where he engaged in the research and development of BiCMOS cache macro for high performance microprocessors. He has currently designed video compression/decompression LSI in Logic Device Design Section. Takayasu Sakurai was born in Tokyo, Japan in 1954. He received the B.S., M.S. and Ph.D degrees in electronic engineering from University of Tokyo, Tokyo, Japan, in 1976, 1978, and 1981, respectively. His Ph. D work is on electronic structures of a Si-SiO<sub>2</sub> interface. In 1981 he joined the Semiconductor Device Engineering Laboratory, Toshiba Corporation, Japan, where he was engaged in the research and development of CMOS dynamic RAM and 64 Kbit, 256 Kbit SRAM, 1 Mbit virtual SRAM, cache memories, and BiCMOS ASIC's. During the development, he also worked on the modeling of interconnect capacitance and delay, new memory architectures, hot-carrier resistant circuits, arbiter optimization, gate-level delay modeling, alpha/n-th power MOS model and transistor network synthesis. From 1988 through 1990, he was a visiting scholar at Univ. of Calif., Berkeley, doing research in the field of VLSI CAD. He is currently back in Toshiba and managing multimedia LSI development. His present activities include low-power designs, media processors and video compression/decompression LSI's. Dr. Sakurai is a visiting lecturer at Tokyo University and serving as a program committee member for the Custom Integrated Circuit Conference, the Design Automation Conference, the International Conf. on Computer Aided Design, the International Conf. on VLSI and CAD, International Symp. on Low-Power Electronics and Design and the FPGA Workshop. He is a technical committee chairperson for the '97 VLSI Circuits Symposium. He is a member of the IEEE, and the Japan Society of Applied Physics.