Dynamic Bitstream Length Scaling Energy Effective Stochastic LDPC Decoding

Thomas Marconi and Sorin Cotofana
Computer Engineering Lab, TU Delft, The Netherlands
{T.Marconi, S.D.Cotofana}@tudelft.nl

ABSTRACT
Stochastic Computing (SC) is an attractive solution for implementing Low Density Parity Codes (LDPC) decoders due to its fault tolerance capability and low hardware requirements. However, in practical implementations, SC efficiency is limited by the Stochastic Bitstream (SB) length and by the computation inaccuracies due to non-unique SB representations. In this paper, rather than statically fixing the SB length at run-time, we propose a Dynamic Bitstream Length Scaling (DBLS) technique, which adjusts on-the-fly the SB length such that Quality of Service requirements for energy efficient LDPC decoding are fulfilled. In this way, depending on the communication channel condition, different SB lengths are adaptively utilized such that the best decoding performance vs energy consumption tradeoff is achieved. To evaluate the DBLS practical implications we selected an (1296, 648) LDPC with \(d_v = 3\) and \(d_c = 6\) and implemented our approach and the best state-of-the-art stochastic LDPC decoder with 64-bit edge memory on a Virtex-7 FPGA. Experimental results indicate that our proposal requires 9% more FFs and 3% more LUTs while diminishing the energy consumption by 31-80% and providing 1.5-5.1x higher throughput.

1. INTRODUCTION
Low Density Parity Codes (LDPC) codes, introduced by Gallager in 1962 [6], have a decoding performance which is very close to the Shannon limit [7]. Due to their high error correction capability LDPC codes have been adopted by communication hand-held systems, e.g., DVB-S2 [1], 10GBase-T [4], with very powerful in terms of error correction it has high complexity and throughput. To evaluate the DBLS practical implications we selected a fully parallel implementation of a 64-bit Edge Memory (EM) [12] stochastic (1296, 648) LDPC decoder with 64-bit edge memory on a Virtex-7 FPGA. Experimental results indicate that our proposal requires 9% more FFs and 3% more LUTs while diminishing the energy consumption by 31-80% and providing 1.5-5.1x higher throughput.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

GLSVLSI’15, May 20–22, 2015, Pittsburgh, PA, USA.
Copyright © 2015 ACM 978-1-4503-3474-7/15/05 ...$15.00.
http://dx.doi.org/10.1145/2742060.2742117.

2. PROPOSED TECHNIQUE
The Stochastic Computing paradigm proposed by Gaines [8], Ribeiro [11], and Poppelbaum et al. [10] in 1967, represents a number \(x\) as a Stochastic Bitstream (SB) \(X = X_1X_2X_3...X_{L_s}\) such that \(P(X_i = 1) = x\) where \(0 \leq x \leq 1\) and \(L_s\) is the SB length. Thus an SB of \(L_s\) bits of \(N_i\) comprising bits being equal to 1, encodes a number \(x = \frac{N_i}{L_s}\). This number representation has two main advantages: (i) arithmetic computations can be done with low complexity hardware and (ii) has intrinsic high fault tolerant capabil-
This representation allows for the evaluation of complex arithmetic operations by using simple logic gates. For instance, we can use an AND gate to implement a multiplication as illustrated in Figure 1. Moreover, by relying on an identically-weight representation, SC has a high fault tolerant capability, e.g., flipping one bit at position $n$ of an SB with length $L_s$ causes only an $\frac{1}{2^n}$ difference irrespective of the $n$ value while the same bit flip in traditional binary representation creates a $2^{n-1}$ difference. The main SC drawbacks are: (i) long computation latency due to the required long SB representations and (ii) inaccurate computation results caused by the non-unique number representation. For example, illustrated in Figure 1(a), when we change $B_3$ from 10 to 01, the 2nd AND gate produces an incorrect result; $\frac{1}{2}$ times $\frac{1}{2}$ should be equal to $\frac{1}{2^3}$ as produced by the 1st AND gate, however, the 2nd AND gate’s result is $\frac{1}{2^6}$. It can be easily observed that the computation error of Fig 1(a) is $\frac{1}{2^3} - \frac{1}{2^6} = \frac{3}{2^6}$ and the computation latency is 20 clock cycles. If we shorten the SB length $L_s$ from 20 to 10, the computation time shortens from 20 to 10 clock cycles as shown in Fig 1(b), the computation error will increase to $\frac{1}{2^3} - \frac{1}{2^5} = \frac{3}{2^5}$. This simple motivational example clearly demonstrates that various computation error vs latency trade-offs are possible in the SC context. Using longer (shorter) SBs produce more (less) accurate results with longer (shorter) latencies. Since the energy consumption proportionally depends on latency, trading accuracy for energy is also possible.

Trading accuracy at the right time is the key idea behind our proposed technique to diminish energy consumption while fulfilling Quality of Service (QoS) requirements. Trading accuracy in the moment accuracy is needed to fulfill QoS requirements is not a good option. To guaranty QoS while diminishing energy consumption, we need to trade accuracy only in the time moments when QoS can be provided without requiring high computation accuracy. To be able to properly adjust at run time the SB length to the channel status we perform at design time a decoder pre-characterization, i.e., we measure decoder’s Frame Error Rate (FER), Bit Error Rate (BER), energy/bit and throughput under SB length scaling on a variety of noisy channels.

The way we apply the DBLS technique during the LDPC decoding process is presented in Figure 2. We utilize the SNR Estimator present in any standard communication systems, to provide SNR information to the Adaptive Controller. This information is essentially needed to assist the Adaptive Controller in adjusting the SB length $L_s(t)$ based on the instantaneous channel condition represented by $\text{SNR}(t)$. Based on $\text{SNR}(t)$ and the pre-characterization results, the Adaptive Controller tunes $L_s(t)$ at $\text{SNR}(t)$ is done by selecting SB length from pre-characterization results that minimize some metrics (i) BER denoted as Best BER (BB), (ii) FER called as Best FER (BF), or (iii) energy/bit named as Lowest Energy (LE) strategy.

In the following we describe the main steps of the proposed adaptive stochastic LDPC decoder. For each received message we first determine the SB length value $L_s(t)$ according to the actual SNR value. This adaptation is done on-the-fly, no decoding stop nor restart is required. Subsequently, in the first $L_s(t)$ clock cycles, soft messages (i.e., probabilities) are converted by the Binary to Stochastic Converter (B2S) blocks to SBs with length $L_s(t)$, bit by bit in each clock cycle. In this first iteration, the multiplexers direct the SBs to the Stochastic Decoder Circuit (SDC), i.e.,
Variables Nodes (VNs) and Check Nodes (CNs), for processing. The $L_s(t)$-bit SB produced at each SDC output is stored in the Stochastic Stream Memory (SSM). This SSM should have adjustable size to accommodate our DBLS technique. The architecture of this adjustable SSM is presented in Figure 8. Different from fixed-length SSM, this tunable SSM makes use of a multiplexor to select $Q_{L_s(t)}$ for output. In this way an SB length of up to $L_{max}$ can be accommodated. In the subsequent $L_s(t)$ clock cycles, the multiplexer inside SSM in collaboration with the multiplexer in Figure 2 feeds back $L_s(t)$-bit stored in SSM as inputs to the SDC for subsequent iterations. At the end of each iteration, the Parity Check Circuit (PCC) verifies the parity constraints for early termination purpose. To do this verification, the PCC makes use of the hard bit values provided by the Stochastic to Binary Converter (S2B). Since DBLS technique needs to support adjustable SB length, the S2B needs to facilitate adjustable SB length hard decisions as presented in Figure 4. By letting the comparator making hard decisions controllable by the Control Unit as indicated in the figure, the variable SB length is supported by S2B. The SB iterative processing stops either when all check nodes are satisfied or when the maximum number of iterations is surpassed.

3. EVALUATION

Although our technique is in principle applicable to any SC machines for which the SB length vs performance relation can be pre-characterized, in this paper, we evaluate its effectiveness by applying it to a fully parallel implementation of the 64-bit EM [12] stochastic LDPC decoder reported as the best SC-based decoder in terms of decoding performance in, e.g., [9]. More precisely, a circuit decoding regular (1296, 648) LDPC code with VN degree of 3 ($d_v = 3$), and PN degree of 6 ($d_c = 6$) is utilized. Input messages in form of probabilities are 4-bit quantized. Each probability is represented by a 16-bit SB. The maximum number of iterations is set at 100. Each iteration takes 16 decoding cycles and for each decoding cycle one clock cycle is needed. The EM decoders with DBLS technique using proposed strategies, denoted as BB, BF, and LE, are evaluated against their original version, named as EM, in terms of: (i) decoding performance, specifically FER and BER, (ii) area - the number of occupied FPGA hardware resources (i.e., FFs and LUTs), (iii) throughput (Mb/s), and (iv) energy/bit (pJ/bit). An Additive White Gaussian Noise (AWGN) channel with Binary Phase Shift Keying (BPSK) modulation (with mapping $0 \rightarrow 1$ and $1 \rightarrow -1$) is generated to mimic a realistic telecommunication channel in our experiments. We apply the Noise-Dependent Scaling (NDS) technique [12] with $\alpha = 3$ and $\gamma = 6$ to the AWGN channel prior to decoding. Similar to [9], all evaluation figures are obtained from direct FPGA measurements.

To facilitate a real hardware evaluation, we develop an experimental hardware platform that consists of a laptop and a Xilinx VC707 board as depicted in Figure 5. The laptop is dedicated to: (i) designing the hardware platform targeting Xilinx Virtex-7 FPGA: XC7VX485TFFG1761-2 inside the Xilinx board and generating the bitstream files; (ii) downloading the bitstream files for FPGA hardware and the software files for the MicroBlaze through the USB JTAG interface; (iii) monitoring/capturing the number of iterations and decoding outcomes, FER, and BER through the USB UART, and (iv) measuring Energy/bit using the Fusion Digital Power Designer from Texas Instrument through Texas Instrument USB Interface adapter by reading PMBus, and accessing Power Supply Monitor and Controller inside the board. To build the evaluation hardware support (e.g., AWGN channel emulator, BPSK modulator, and other functionality for evaluation purpose) and the decoders, the Xilinx board is utilized. The MicroBlaze processor, resident on the FPGA, in collaboration with the LDPC Monitoring and Controller (MC) mainly acts as the evaluation hardware support by accessing software and data stored in BRAM and DDR3. The MC monitors the decoding outcome and conveys 4-bit quantized probabilities from the MicroBlaze to the evaluated decoder. The two possible outcomes of the decoding process are: (i) “success” and (ii) “give up”. If all check nodes are satisfied, the outcome is “success”. In the case that all check nodes cannot be satisfied after the maximum number of iterations, the system reports “give up”. The MicroBlaze processor utilizes these outcomes for computing the statistical results of the experiments (i.e. BER, FER). The laptop displays the experimental results received from FPGA board through USB UART.

The experimental results in terms of FER, BER, energy efficiency, and throughput are presented in Figures 6 and 7.
strategy provides the largest energy reduction (49-80%) and thanks to its SB length adaptability to SNR. (v) The LE energy by 31-77% and increasing the throughput by 1.5-4.3x. The BF strategy is minimizing FER while lowering the energy/bit by 32-76% when compared to EM due to the higher main clock frequency, which is 182.449 MHz, but it mainly designed to minimize BER, it also reduces the energy. When implemented on a Virtex-7 FPGA, despite its area overhead of 9% more FFs and 3% more LUTs, the technique outperforms the best equivalent state of the art counterpart in terms of energy efficiency by 31-80% and throughput by 1.5-5.1x while fulfilling the QoS requirements.

5. REFERENCES

[4] Ieee p802.3an (10gbase-t) task force.

Acknowledgment

This work was supported by the Seventh Framework Programme of the European Union, under the Grant Agreement number 309129 (i-RISC project).