# A 370-fJ/b, 0.0056 mm<sup>2</sup>/DQ, 4.8-Gb/s DQ Receiver for HBM3 with a Baud-Rate Self-Tracking Loop Han-Gon Ko<sup>1</sup>, Soyeong Shin<sup>1</sup>, Chan-Ho Kye<sup>1</sup>, Sang-Yoon Lee<sup>1</sup>, Jaekwang Yun<sup>1</sup>, Hae-Kang Jung<sup>2</sup>, Doobock Lee<sup>2</sup>, Suhwan Kim<sup>1</sup> and Deog-Kyoon Jeong<sup>1</sup> <sup>1</sup>Seoul National University, Seoul, Korea and <sup>2</sup>SK Hynix, Icheon, Korea dkjeong@snu.ac.kr #### **Abstract** This paper presents a data (DQ) receiver for HBM3 with a self-tracking loop that tracks a phase skew between DQ and data strobe (DQS) due to a voltage or thermal drift. The selftracking loop achieves low power and small area by utilizing an analog-assisted baud-rate phase detector. The proposed pulse-to-charge (PC) phase detector (PD) converts the phase skew to a voltage difference and detects the phase skew from the voltage difference. An offset calibration scheme that can compensates for a mismatch of the PD is also proposed. The proposed calibration scheme operates without any additional sensing circuits by taking advantage of the write training of HBM. Fabricated in 65 nm CMOS, the DQ receiver shows a power efficiency of 370 fJ/b at 4.8 Gb/s and occupies 0.0056 mm<sup>2</sup>. The experimental results show that the DQ receiver operates without any performance degradation under a ± 10% supply variation. **Keywords:** HBM, memory interface, phase detector, forwarded-clock (FC) receiver and baud-rate CDR. #### Introduction Recently, the high bandwidth memory (HBM) DRAM shows higher bandwidth and better energy efficiency than the graphic DDR (GDDR) DRAM. Since HBM utilizes the 3D stacking package, a voltage and thermal drift problem is more severe than other 2D packaged DRAMs [1]. When the voltage or thermal drift occurs after the write training, the relative phase between the data (DQ) and the data strobe (DQS) changes and the sampling timing margin is degraded. This problem can be mitigated by inserting a delay line that mimics DQS path in front of DQ samplers. However, since HBM has many DQ lanes compared to the other DRAM interfaces, the cost of the replica delay in terms of power and area is inacceptable. Therefore, a self-tracking loop that can compensate for the timing skew in background is necessary. Unlike to the conventional clock and data recovery (CDR), the DQ receiver of HBM cannot adopt the 2X oversampling CDR architecture directly since only baud-rate clock is available. In addition, since DQS toggles only in preambles and write bursts in the burst mode operation, a phase interpolator based CDR architecture is also not acceptable. This paper presents a low-power small-area 4.8-Gb/s DQ receiver with the self-tracking loop utilizing an analog-assisted baud-rate phase detector (PD). Since the proposed pulse-to-charge (PC) PD operates only with the baud-rate clock, it is suitable for HBM3 and is effective in terms of power and area. ## **DQ** receiver implementation A. Overall architecture: Fig. 1 shows the overall architecture of the proposed DQ receiver. The amplifier amplifies the 400-mV 2.4-GHz DQS input to the full swing and the IQ divider divides the differential 2.4-GHz DQS to the 1.2-GHz 4-phase quadrature clocks. The 4.8-Gb/s, 4-channel DQ inputs are sampled by the 1.2-GHz quadrature clock. The proposed PC PD detects the phase error between DQ and DQS without the extra phase clock for edge sampling. The operation of the PC PD is described in detail in the next section. The digital loop filter accumulates the output of the PC PD and adjusts the delay control word (DCW) to the optimum sampling point. B. Proposed pulse-to-charge phase detector: Fig. 2 shows the circuit implementation and the timing diagram of the proposed PC PD. If there is a rising edge of DQ between the rising edges of DQS\_I and DQS\_Q, the PC PD charges V<sub>c</sub>, from the rising edge of DQS\_I to the rising edge of DQ and charges V<sub>c</sub>, from the rising edge of the DQ to the rising edge of the DQS\_Q. If DQS leads DQ, the time between the rising edge of DQS\_I and the rising edge of DQ is greater than the time between the rising edge of DQ and the rising edge of DQS\_Q, so V<sub>c</sub>, is greater than V<sub>c</sub>, and UP becomes logic 0 and vice versa. The output of the PC PD is valid when there is a rising edge of DQ between the rising edge of DQS\_I and DQS\_Q. Since the PC PD calculates the phase error in the voltage domain, a random variation of a capacitance or a threshold voltage can induce an offset. If there is a mismatch between the capacitors or an offset in the sampler, the locking point of the self-tracking loop can deviate from the optimum point. Fortunately, DQ and DQS are aligned to the optimum sampling point after the write training sequence and the PC PD can be calibrated using the optimum sampling point as a reference. If DQ and DQS are aligned to the optimum sampling point, the output of the PC PD represents the polarity of the offset. The conceptual timing diagram of the proposed offset calibration scheme is shown in Fig. 3. The optimum calibration code can be found by changing the calibration code according to the output of the PC PD and then checking the output of the PC PD again. In this paper, the 4-bit calibration code is retrieved using a binary search so the offset calibration of the PC PD is done in a few DQS cycles. C. Other blocks: The 6-bit digitally controlled delay line (DCDL) are implemented using inverters, MOS capacitors and NMOS switches. The delay of the DCDL is controlled by adjusting the load capacitance of the inverter. The amplifier consists of 2-stage differential amplifiers. The first stage amplifier uses the resistive load for high bandwidth and the second stage amplifier uses the active load for large gain. ## **Experimental Results** Fig. 4 shows the chip photomicrograph, core layout and the power breakdown of the prototype chip. A pattern generator, an emulated silicon interposer channel and a transmitter are implemented to emulate the environment in which the DQ receiver operates in HBM3. The total power consumption is 7.04 mW, among which the power overhead of the self-tracking loop is 1.47 mW. Fig. 5 shows V<sub>ref</sub> shmoo plots of the DQ receiver with and without the proposed self-tracking loop and the offset calibration scheme. The timing step and V<sub>ref</sub> voltage step are 3 ps (1.4 mUI) and 25 mV, respectively. Assuming that the write training is performed with a 1.1-V supply, the margin is defined as the maximum tolerable DQS timing variation from the trained point for the error-free operation. Fig. 6 shows the measured timing margin versus the supply voltage. After the write training is performed, the timing margin is maximized to 0.38 UI in both with and without the self-tracking loop. However, if the voltage drift occurs, the margin of the DQ receiver without the self-tracking loop reduces significantly. The margin of the DQ receiver with the self-tracking loop remains almost unchanged from 1 to 1.2-V supply variation. When the proposed offset calibration scheme is disabled, the lock point deviates from the optimum sampling point due to the offset and the margin is degraded by 0.07 UI. However, since the self-tracking loop tracks the skew due to the voltage drift, the margin does not change with the supply variation. Table I compares this work against state-ofthe-art forwarded-clock receivers. The proposed DQ receiver achieves the smallest area per lane and comparable power efficiency among the recent works. **Acknowledgements** This paper was a result of the research project supported by SK Hynix Inc. and Inter-university Semiconductor Research Center (ISRC) of Seoul National University. ### References - [1] K. Sohn, JSSC, 2017, pp.250-260. - [2] S. Chen, CICC, 2013, pp.1-4. - [3] W. Bae, TCAS-I, 2016, pp.1393-1403. - [4] Hao Li, VLSI, 2014, pp. 1-2. Fig. 1. Overall architecture of proposed DQ receiver. Fig. 2. (a) Circuit implementation and (b) timing diagram of the proposed pulse-to-charge phase detector. Fig. 3. Timing diagram of the offset calibration of the PC PD. Fig. 4. Chip photomicrograph, core layout and power breakdown. Fig. 5. V<sub>ref</sub> shmoo plots of the proposed DQ receiver (a) without the self-tracking loop and (b) with the self-tracking loop and calibration. Fig. 6. Measured timing margin versus supply voltage. TABLE I PERFORMANCE COMPARISON WITH FC RECEIVER | TERGORDINALE COMPTRISON WITH TERECEIVER | | | | | | | |-----------------------------------------|----------------------|-----------------------|----------------------|------------------------|-----------|-----------| | | CICC'13<br>[2] | TCAS-I'16<br>[3] | SOVC'14<br>[4] | This work | | | | Data Rate | 6.4 Gb/s | 12.5 Gb/s | 14 Gb/s | 4.8 Gb/s | | | | Clock rate | 3.2 GHz | 6.25 GHz | 3.5 GHz | 1.2 GHz | | | | Technology | 28 nm | 65 nm | 65 nm | 65 nm | | | | CDR arch. | DLL | DLL | PI | DLL | | | | Area* | 0.02 mm <sup>2</sup> | 0.025 mm <sup>2</sup> | 0.36 mm <sup>2</sup> | 0.0056 mm <sup>2</sup> | | | | VDD | 0.85/1.1 V | 0.9 V | 0.8 V | 1 V | 1.1 V | 1.2 V | | Power efficiency | 1.2 pJ/b | 0.36 pJ/b | 0.56 pJ/b | 0.28 pJ/b | 0.37 pJ/b | 0.47 pJ/b | <sup>\*</sup> Area per data lane