# A Low-Power and Low-Noise 20:1 Serializer with Two Calibration Loops in 55-nm CMOS

Yong-Un Jeong Dept. of Eletrical and Computer Engineering Seoul National University Seoul, Korea yongun.jeong@analog.snu.ac.kr Joo-Hyung Chae SK Hynix Icheon, Korea joohyung.chae@sk.com

Shin-Hyun Jeong Dept. of Eletrical and Computer Engineering Seoul National University Seoul, Korea shinhyun.jeong@analog.snu.ac.kr

Abstract-The increasing data rate of serial links makes it difficult to match timing constraints of serializers in transmitters. Delay compensation clock buffers can alleviate this issue by matching the timing between data and clock. However, these buffers consume significant power and become sources of noise to the transmitter output. The problem is more serious for serializers other than 2<sup>n</sup>:1, and using only 2<sup>n</sup>:1 serializer could be a limitation on system design. In this paper, a 20:1 serializer using two calibration feedback loops is presented to solve this issue and reduce power consumption. The two loops detect the phase difference between data and clock, and automatically align the clock phase to the center of the data phase. The loops eliminate the power-consuming clock buffers on the critical clock path and operate at maximum quarter rate, enabling the transmitter to have low power consumption and high performance. A 6.4 Gb/s serializer prototype is fabricated in 55-nm CMOS process with a 1.2 V supply voltage. It achieves 97.5 ps eye width, which is 62.4% of a unit interval (UI) using PRBS-7 data, and its energy efficiency is 1.60 pJ/bit.

Keywords— Serializer; SerDes; Multiplexer; MUX; Transmitter; Calibration loop; Feedback; Low power

## I. INTRODUCTION

With the growth of cloud, mobile, and display industries, high speed serial links are widely used and serve as a costefficient method to transfer large data with fewer pins. Signal integrity and low power consumption are two major challenges for designing high-speed transceivers. However, the increase in the data rate of serial links causes an increase in the power consumption of I/O transceivers, the main causes of which are the narrow timing margin due to the decreasing unit interval (UI) and the high required bandwidth [1].

A major challenge in high-speed transmitters is the delay difference between the clock path and the data path at each stage of the serializer [2-6]. This delay difference varies with process, supply voltage and temperature (PVT) variations, and the Sungphil Choi Dept. of Eletrical and Computer Engineering Seoul National University Seoul, Korea sungphil.choi@analog.snu.ac.kr Jaekwang Yun Dept. of Eletrical and Computer Engineering Seoul National University Seoul, Korea jaekwang.yun@analog.snu.ac.kr

Suhwan Kim Dept. of Eletrical and Computer Engineering Seoul National University Seoul, Korea suhwan@snu.ac.kr



Fig. 1. (a) High speed serial link and (b) conventional 20:1 serializer

difference increases the jitter of the data eye and degrades the performance of the interface, as shown in Fig. 1 (a). Also, the difference of more than 1-UI causes setup and hold violation. Insertion of replica clock buffers in the clock path is the simplest and most common solution to this issue [7,8]. However, the



Fig. 2. (a) Timing constraint in 2:1 serializers and (b) timing diagram considering delay

addition of these buffers introduces the following problems: 1) the buffers become a clock's noise source, thus degrading and limiting the performance of the serializer and the entire transmitter; 2) buffers operating at high speed consume substantial power [9]; 3) if the serializer is multi-stage, the number of replica buffers increases after each stage. In the case of a 2<sup>n</sup>:1 serializer with several 2:1 serializers in series, the delay difference between stages exist because the loading of each stage is about half of that of the previous stage [5]. However, in the case of serializers other than 2<sup>n</sup>:1 such as a 20:1 serializer, it is more difficult to obtain a timing margin. This is because the clock loading difference between the first 5:1 and the second 2:1 serializing stage is approximately fivefold, as shown in Fig. 1 (b) [10, 11]. In order to meet the timing constraints, a large number of replica clock buffers should be inserted in not only the second stage but also the last stage. This means that the problems mentioned above are becoming more serious.

Calibration loops are used as one of the attempts to alleviate this constraint in transmitters [4, 5, 12, 13]. It can adjust the timing between data and clock by constructing a closed loop to remove the power-consuming delay matching clock buffers. However, due to the replica serializer consuming additional power and the phase detector of the loop operating at half rate, the calibration loop consumes considerable power [4]. Also, since the calibration loop operates by data transitions, data errors exist during initial operation and the loop takes significant time to settle until there are enough data transitions [5, 12]. The calibration loop is connected in series with the high-speed clock of the last stage, which becomes an additional noise source. Also, the last stage operates at half rate resulting in high power consumption and difficult circuit design [13]. Another alternative is to place a 8:1 or a 4:1 serializer in lieu of 2:1 serializers at the end to mitigate the timing constraint [3, 6, 7, 14-16]. However, this method requires the use of multi-phase clocks, and the skew between the clocks affects the performance of the serializer, thus complex circuits are needed for accurate matching. A 8:1 serializer or a 4:1 serializer can degrade the bandwidth of the serializer, which has a high loading at the output. In addition, most of the existing efforts address these issues only for  $2^{n}$ :1 serializers and the last serializing stage.

In this paper, a low-power, high-performance 20:1 serializer is introduced. Two calibration loops are used to automatically align phases of clock and data, and to reduce power consumption by eliminating power-hunger replica clock buffers and the need for maximum quarter rate operation. In Section 2 analyzes the timing constraints of the 20:1 serializer, and Section 3 describes the proposed 20:1 serializer and its calibration loops. Measurement results of a 6.4 Gb/s prototype and conclusion are drawn in Section 4 and 5, respectively.

### II. TIMING CONSTRAINTS OF 20:1 SERIALIZER

Fig 2 (a). shows the block diagram of a 2:1 serializer including its clock tree. In the 4:2 serializing stage, the data is triggered by the rising edges of CLK2, which causes delays between CLK2 and output data. This data is then sampled in a 2:1 serializing stage with CLK two times faster than CLK2. As shown in Fig. 2 (b), the loading of CLK is about half of that of CLK2, thus the delay of buffers for CLK, t1, is shorter than that of CLK2, t2, where the delay difference between CLK and CLK2 occurs. When CLK samples, CLK and data must be aligned correctly even if there is a timing difference in between. This is the timing constraint in the 20:1 serializer and is one of the major challenges limiting the operation of the transmitter at high speeds. To alleviate this issue, clock buffers that match the delay should be used in the clock path. The compensation delay, tc1, can be expressed as

$$t_{margin} < tc1 - (t2 - t1) < 1UI - t_{margin},$$
 (1)

where t2 and t1 are the delay of the data path and the clock path, respectively, and  $t_{margin}$  is the required timing margin for proper operation, which is about 12% of 1-UI [5].

The 5:1 serializer and its clock tree are shown in Fig. 3 (a). The data is sampled using the edge of CLK5 at five flip-flops



Fig. 3. (a) Timing constraint in 5:1 serializer and (b) timing diagram without considering delay

and serialized by five selection signals, S0-4. The selection signals are generated by CLK5 and CLK, in which replica clock buffers should be used to compensate for the timing differences between them. The compensation delay, *tc*2, can be expressed as

$$t_{margin} < tc2 - (t4 - t3) < 1UI - t_{margin},$$
 (2)

where t3 and t4 are the delay of CLK path and CLK5 path, respectively. Fig. 3 (b) shows the timing diagram of the 5:1 serializer without considering the delay. If timing mismatch exists between CLK and CLK5, the selection signal cannot sample the data signal accurately. However, as shown in Fig. 3 (a), the loading of CLK5 is about five times that of CLK, which is much larger than the loading of CLK2 in 2:1 serializers [10, 11]. This means that the compensation buffer delay, tc2, is much larger than tc1. The long delay is difficult to meet the tight timing constraints at high speeds because the delay of the buffer changes with PVT variations.

The use of replica clock buffers to mitigate timing constraints is an intuitive and straightforward method, but as mentioned earlier, it has several problems. The jitter performance of the serializer is mainly determined by the last clock closest to the output data [4], and the output of the serializer directly affects the performance of the overall transmitter. Therefore, using additional buffers in the critical clock path makes the transmitter vulnerable to noise such as SSN and supply noise. If the serializer is composed of several stages, the number of buffers increase at each stage. Consequently, a large number of buffers is required at the last stage. Other than  $2^n$ :1 serializers, such as a 20:1 serializer, the problem becomes worse due to the large loading difference between stages. Additional circuits such as delay locked loops (DLLs) are required to keep constant long delays that vary with

PVT changes. Another problem is that buffers operating at high speed consume large power [9]. In this paper, we propose a low-power serializer that can break this trade-off and operate at high speed.

## III. PROPOSED 20:1 SERIALIZER AND CALIBRATION LOOPS

Fig. 4 shows the proposed 20:1 serializer that automatically alleviates timing constraints using two calibration loops. It reduces the loading of high-speed clocks by placing a 5: 1 serializer at the front [11]. The first loop automatically aligns the timing of CLK and CLK2 by adjusting the phase interpolator (PI). The flip-flop, which compares two clock phases, operates at quarter rate using CLK2 and the remaining circuits operate at very slow speeds with CLK10. A digital PI with a control range of 360° is used to reduce power consumption in this loop [17]. Since the 5:1 serializer operates at a low speed, the timing margin is relatively large as shown in Eqn. 2. Therefore, to avoid excessive power consumption, the loop is designed to select one of the two clocks in the 1-UI interval using a simple MUX instead of PI. The loop has a large 1-bit 1-UI resolution compared to PI, but the resolution is sufficient to satisfy timing constraints. By using two loops, the most critical clock, CLK, is not exposed to any replica buffers, thus being minimally exposed to noise. The loops, which operate using only clocks, not data, do not compensate for one  $t_{CK-Q}$  delay [4, 5, 12]. However, this delay can be compensated asynchronously because it is shorter than the delay of the clock buffers. The proposed serializer can operate at low power because it does not use the replica 2:1 serializer [4]. Since the two calibration loops do not use the transition of data, they can always operate rapidly in the background [5, 12].

The timing diagram of the two loops of the proposed 20:1 serializer is shown in Fig. 5. The first loop samples CLK at the rising edge of CLK2 to determine the phase difference between



Fig. 4. Proposed 20:1 serializer and calibration loops



(a) 12C Control<math display="block">12C ControlCLK BUF<math display="block">12C ControlCLK BUFCLK B

Chip

(b)

汮

90 µm

δ

Ð

ி

Oscilloscope

Fig. 5. Timing diagram of (a) first loop and (b) second loop

Fig. 6. (a) Microphoto and (b) measurement environment

ய

Clock Source

the two clocks, as shown in Fig. 5 (a). CLK2 is faster than CLK if the sampled result is LOW, thus increasing the control codes of PI. CLK2 is slower than CLK if the sampled result is HIGH, thus decreasing the control codes of PI. Therefore, when the loop is locked, the rising edge of CLK2 is aligned with the rising edge of CLK2, hence the center of data. The first loop does not have any means to stop the toggling of PI control codes to reduce power consumption when locked, but this toggling is small enough that the PI's resolution satisfies the timing constraint of Eqn. 1. Fig.

5 (b) shows the timing diagram of the second loop. As shown in Fig. 3, if there is a difference between CLK2 and CLK10 over 1-UI of CLK2 in the 5:1 serializer, the next or previous rising edge of CLK2 is compared with CLK10. Thus, accurate phase matching is important, not delay matching. For example, if t4 - t3 is 2.3UI of CLK2, the calibration loop only needs to compensate for 0.3UI, not 2.3UI. Therefore, using loops based on a simple MUX instead of power-consuming buffers can



Fig. 7. Measured 6.4 Gb/s eye diagram with PRBS-7 data



Fig. 8. Measured bathtub with 6.4 Gb/s PRBS-7 data

significantly reduce power consumption. When CLK10 samples CLK2, the 1-bit selection code of the MUX is maintained if the result is HIGH. If the result is LOW, the output of the MUX is changed. When the loop is locked, the rising edge of CLK10 is located in the HIGH section of CLK2, meaning that the falling edge of CLK2 has a timing margin of 1-UI or more when CLK10 is sampled in the selection signal generator. The loop not only ensures timing margins when generating selection signals, but also matches timing between selection signals and corresponding data. The simple-structured loop with only 1-bit resolution is power efficient while ensuring a sufficient timing margin. In any case, the two calibration loops adjust only CLK2 and CLK10, and CLK is not affected. This allows the serializer to maintain constant performance regardless of PVT variations without causing any problems in overall performance.

## IV. MEASUREMENT RESULTS

A prototype of the proposed serializer was fabricated in a CMOS 55-nm process. It occupies  $0.0207 \text{ mm}^2$  (230 µm x 90 µm), as shown in Fig. 6 (a), and consists of a 20:1 serializer, clock dividers, and clock tree. The prototype to verify the proposed serializer has a pattern generator, a clock buffer, and a driver. The serializer receives PBRS-7 data, generated by the pattern generator, and the clock buffer receives an external clock.



Fig. 9. Measured jitter at 6.4 Gb/s according to changes in the first loop's control codes

| TABLE I. | COMPARISON   | WITH PREVIOUS | SERIALIZERS |
|----------|--------------|---------------|-------------|
|          | e on maio on |               | obien innen |

|                               | [2]               | [3]   | [10]  | This Work |
|-------------------------------|-------------------|-------|-------|-----------|
| Process (nm)                  | 90                | 180   | 180   | 55        |
| Supply voltage (V)            | N/A               | 1.8   | 1.8   | 1.2       |
| Data rate (Gb/s)              | 12                | 10    | 3.2   | 6.4       |
| SER ratio                     | 8:1               | 8:1   | 20:1  | 20:1      |
| Power (mW)                    | 308 <sup>a</sup>  | 22.50 | 43    | 10.26     |
| Energy Efficiency<br>(pJ/bit) | 25.67ª            | 2.25  | 13.44 | 1.60      |
| Area (mm <sup>2</sup> )       | 0.12 <sup>a</sup> | N/A   | 0.045 | 0.0207    |

a. Including the output buffer

The driver transfers the output of the serializer to the outside of the chip. Fig. 6 (b) shows the measurement environment. The 3 GHz clock is provided by the external clock generator, and the eye diagram and the jitter of the output data are measured in the oscilloscope. I2C is used to adjust registers inside the chip.

Fig. 7 shows the measured eye diagram with 6.4 Gb/s PRBS-7 pattern. It has an eye width of 97.5 ps, which is 0.62UI. The serializer prototype consumes 10.26 mW at 6.4 Gb/s, and its energy efficiency is 1.60 pJ/bit. Fig. 8 shows the measured bathtub curves at 6.4 Gb/s and the prototype has a bit error rate (BER) of  $10^{-12}$  with a timing margin of 0.62UI.

Fig. 9 shows the measured jitter according to the control codes of the PI in the first calibration loop. The control codes were swept around the calibration result codes, and the measured result shows that the first loop successfully finds the center of the data.

In Table 1, the performance of our design is summarized and compared with that of other serializer designs. The table shows that our proposed serializer consumes lower power with similar data rate than conventional serializers.

### V. CONCLUSION

This paper has presented the design of a power-efficient 20:1 serializer using two calibration loops. The phase between the data and the clock of each stage is automatically aligned through the calibration loops, and the performance of the overall transmitter was improved by simplifying the critical clock path of the last serializing stage and keeping the clock at the center of the data. Since the first calibration loop operates at maximum quarter rate and the second loop is optimized for 1-bit resolution, our proposed serializer consumes less power than conventional serializers.

### REFERENCES

- M.-H. Chien, Y.-L. Lee, J.-R. Goh, S.-J. Chang, "A low power duobinary voltage mode transmitter," *IEEE International Symposium on Low Power Electronics and Design (ISLPED)*, Jul. 2017.
- [2] W.-Y. Tsai, C.-T. Chiu, J.-M. Wu, S. S. H. Hsu, Y.-S, Hsu, "A novel low gate-count pipeline topology with multiplexer-flip-flops for serial link," *IEEE Transactions on circuits and systems I (TCAS-I)*, vol. 59, no. 11, pp. 2600-2610, Nov. 2012.
- [3] J.-Y. Song and O.-K. Kwon, "Low-power 10-Gb/s transmitter for highspeed graphic DRAMs using 0.18-um CMOS technology," *IEEE Transactions on circuits and systems II (TCAS-II)*, vol. 58, no. 12, pp. 921-925, Dec. 2011.
- [4] K. Kanda et al., "A single-40 Gb/s dual-20 Gb/s serializer IC with SFI-5.2 interface in 65 nm CMOS," *IEEE J. of Solid-State Circuits (JSSC)*, vol. 44, no. 12, pp. 3580-3589, Dec. 2009.
- [5] K. Huang, Z. Wang, X. Zheng, C. Zhang, Z. Wang, "A 80 mW 40 Gb/s transmitter with automatic serializing time window search and 2-tap preemphasis in 65 nm CMOS technology, *IEEE Transactions on circuits and* systems 1 (TCAS-I), vol. 62, no. 5, pp. 1441-1450, May 2015.
- [6] A. A. Hafez, M.-S. Chen, C.-K. K. Yang, "A 32-48 Gb/s serializing transmitter using multiphase serialization in 65 nm CMOS technology," *IEEE J. of Solid-State Circuits (JSSC)*, vol. 50, no. 3, pp. 763-775, Mar. 2015.

- [7] X. Zheng et al., "A 40-Gb/s quarter-rate serdes transmitter and receiver chipset in 65-nm CMOS," *IEEE J. of Solid-State Circuits (JSSC)*, vol. 52, no. 11, pp. 2963-2978, Nov. 2017.
- [8] H. Tao et al., "40–43-Gb/s OC-768 16:1 MUX/CMU chipset with SFI-5 compliance," *IEEE J. of Solid-State Circuits (JSSC)*, vol. 38, no. 12, pp. 2169–2180, Dec. 2003.
- [9] L. Lu, Y. Wang, H. Wu, "An energy-efficient high-swing PAM-4 voltagemode transmitter," *IEEE International Symposium on Low Power Electronics and Design (ISLPED)*, Jul. 2018.
- [10] C. H. Hsiao et al., "A 3.2 Gbit/s CML transmitter with 20:1 multiplexer in 0.18 CMOS technology," *International Conference Mixed Design of Integrated Circuits and System (MIXDES)*, pp. 179-183, June 2006.
- [11] J. Park, J.-H. Chae, Y.-U. Jeong, J.-W. Lee, S. Kim, "A 2.1-Gb/s 12channel transmitter with phase emphasis embedded serializer for 55-in UHD intra-panel interface," *IEEE J. of Solid-State Circuits (JSSC)*, vol. 53, no. 10, pp. 2878-2888, Oct. 2018.
- [12] P.-J. Peng, Y.-T. Chen, S.-T. Lai, C.-H. Chen, H.-E. Huang, T. Shih, "6.7 A 112Gb/s PAM-4 voltage-mode transmitter with 4-tap two-step FFE and automatic phase alignment techniques in 40nm CMOS," *IEEE International Solid-State Circuits Conference (ISSCC)*, pp. 124-126, Feb. 2019.
- [13] P.-C. Chiang, H.-W. Hung, H.-Y. Chu, G.-S. Chen, J. Lee, "60Gb/s NRZ and PAM4 transmitters for 400GbE in 65nm CMOS," *IEEE International Solid-State Circuits Conference (ISSCC)*, pp. 42-44, Feb. 2014.
- [14] J. Kim et al., "A 16-to-40Gb/s quarter-rate NRZ/PAM4 dual-mode transmitter in 14 nm CMOS," *IEEE International Solid-State Circuits Conference (ISSCC)*, pp. 60–62, Feb. 2015.
- [15] M.-S. Chen and C.-K. K. Yang, "A 50–64 Gb/s serializing transmitter with a 4-tap, LC-ladder-filter-based FFE in 65 nm CMOS technology," *IEEE J. of Solid-State Circuits (JSSC)*, vol. 50, no. 8, pp. 1903–1916, Aug. 2015.
- [16] P. Chiang, W. J. Dally, M. J. E. Lee, R. Senthinathan, Y. Oh, and M. A. Horowitz, "A 20-Gb/s 0.13-μm CMOS serial link transmitter using an LC-PLL to directly drive the output multiplexer," *IEEE J. of Solid-State Circuits (JSSC)*, vol. 40, no. 4, pp. 1004–1011, Apr. 2005.
- [17] J.-H. Chae et al., "266–2133 MHz phase shifter using all-digital delaylocked loop and triangular-modulated phase interpolator for LPDDR4X interface," *Electronics Letters (EL)*, vol. 53, no. 12, pp. 766-768, June 2017.