Approximate computing for complexity reduction in timing synchronization

This paper presents the design and performance evaluation of a reduced complexity algorithm for timing synchronization. The complexity reduction is obtained via the introduction of approximate computing, which lightens the computational load of the algorithm with a minimal loss in precision. Timing synchronization for wideband-code division multiple access (W-CDMA) systems is utilized as the case study and experimental results show that the proposed approach is able to deliver performance similar to traditional approaches. At the same time, the proposed algorithm is able to cut the computational complexity of the traditional algorithm by a 20% factor. Furthermore, the estimation of power consumption on a reference architecture, showed that a 20% complexity reduction, corresponds to a total power saving of 45%.


Introduction
The continuous growth in demand for bandwidth and mobility has contributed, in the past decades, to the development of a wide set of communication standards. Thus, multistandard, multimode transceivers have become the focal point of radio architects. Flexible radios were then introduced to tackle these new radio requirements [1]. In fact, providing a single architecture that can at run-time modify its behaviour and connect to different radio systems is truly appealing both for end users as well as for the industry of integrated circuits (IC): end users can benefit from it by carrying around a single device instead of a set of devices, while IC's manufacturers can spread the design costs of a single platform over a wider range of applications. However, the introduced flexibility does not come for free.
A higher degree of flexibility is generally obtained through the utilization of software layers (e.g. softwaredefined radio (SDR) [2]) or via the utilization of reconfigurable hardware (e.g. reconfigurable radio systems [3]). These solutions are less power efficient than traditional ASICs. Furthermore, today's and tomorrow's platforms have to relay on the latest available silicon technology in *Correspondence: airoldi.roberto@gmail.com 1 Department of Electronics and Communications Engineering, Tampere University of Technology, Tampere FI-33720, Finland Full list of author information is available at the end of the article order to reach the computational power required by latest standards. However, in ultra-deep-sub-micro (UDSM) technology nodes, power consumption is not any more a secondary constraint. In fact, power consumption plays a major role in the system reliability for UDSM technology nodes [4]. Thus, the combination of power inefficiency at the architectural level and power issues at the circuital level requires optimization at each possible level: from the algorithm design down to the physical implementation of the system.
At the algorithm level, many things can be done in order to improve the power efficiency. In fact, an algorithm that is able to take full advantage of the underlying architecture can more efficiently utilize the resources made available by the hardware. In flexible radios, the flexibility of the underlying hardware is generally utilized to enable swap of functionalities, protocol updates and so on. However, the given flexibility can also be utilized at algorithm level to enable more efficient power implementations. New solutions at algorithm level have to be found by analysing the application domain together with the hardware platform.
Kernels that do not require a high level of correctness are present in certain subsets of some applications. As an example, a system might be interested in knowing if a certain variable has risen over a given threshold or not. At the same time, the system might not find any useful http://asp.eurasipjournals.com/content/2014/1/155 information in the knowledge of the actual value of the variable. An accurate computation of the value requires the system to spend a certain amount of energy for guaranteeing the correctness of the computation. Dropping this redundant control, the system could provide an approximate value of the variable saving valuable energy. Many studies have shown that approximate computing is useful in a variety of application scenarios [5][6][7][8][9].
This research paper presents the design of a reducedcomplexity matched filter for the implementation of the timing synchronization block. The aim of the design was to reduce the computational complexity while maintaining the overall performance at an acceptable level by dynamically adapting the matched-filter performance according to the estimated signal-to-noise-ratio (SNR). The complexity reduction is based on the approximate computing paradigm. The proposed algorithm is evaluated in different working scenarios (noise conditions) in order to validate its functionality and its performance.

Timing synchronization algorithm
Timing synchronization is one of the most critical kernels at the receiver side. In fact, the overall performance of the system is highly dependent on the synchronization stage. In orthogonal frequency division multiplexing (OFDM) systems, as an example, errors in the timing synchronization lead to inter-symbol interference while frequency offsets cause inter-subcarrier interference [10]. Different communication systems rely on different synchronization algorithms. In any case, the synchronization algorithm is built around two major computational kernels: correlation and autocorrelation functions. These kernels are most often implemented as a matched filter. Depending on the chosen communication systems, the utilization of correlation is preferred over the autocorrelation, or vice versa. Moreover, different protocols might require different lengths of the (auto)correlation sequence.
Programmable architectures for the implementation of these kernels allow the utilization of different sequence lengths as well as the change between the computation of correlation or autocorrelation functions, in order to support many different standards. This inherent flexibility can then be utilized at algorithm level to reduce the algorithm complexity of the matched filter. In particular, the utilization of approximate computing can be utilized to reduce the amount of computation required for the calculus of the (auto) correlation value.
As proof-of-concept, the design of the reducedcomplexity matched filter proposed in this work is based on the specification for the wideband-code division multiple access (W-CDMA) system [11]. However, the proposed methods could be also ported to other communication systems.

Timing synchronization in W-CDMA systems
Timing synchronization for W-CDMA systems can be divided into two main algorithms: 1) the cell search algorithm [12] and 2) the multipath delay estimation algorithm [13].

Cell search algorithm
The cell search algorithm takes care of keeping the mobile terminal connected to the base station that offers the best SNR. The algorithm is divided into three consecutive steps: slot synchronization, frame synchronization and scrambling code identification. Each step is characterized by the computation of a correlation between the incoming signal and the known sequences. Despite the step considered, the correlation function can be written as where C n is the nth sample of the correlation value, L is the length of the known sequence (256 in this case), R n−i is the (n − i)th sample of the incoming signal and Coeff * i is the complex conjugate of the ith sample of the known sequence. Finally, the detection of peaks in the correlation sequence defines the alignment of slots and frame structures.

Multipath estimation algorithm
The multipath estimation algorithm estimates and compensates the multipath components. Multipath estimation is performed in two steps: i) the identification of slot boundaries and ii) the evaluation of the multipath components via the computation of a noncoherent average of the correlation function over the following N slots from the slot identification point. All these steps rely on Equation 1.

Related work
Previous research has addressed the improvement of W-CDMA timing synchronization in its different parts and targeting different aspects of the algorithms. However, most of the proposed solutions are based on optimization done at architectural level or at circuit level. Li et al. in [14] present a low power design for the W-CDMA cell search. The work introduces a robust complexity algorithm for synchronization under large frequency and clock errors. The implementation of the algorithm is based on a pipelined search in order to increase the performance of the search [15]. Finally, the implementation on CMOS technology shows an achieved power saving of 51% with an area reduction of 31.9% over traditional solutions.
Korde et al. in [16] propose an improved design for the matched filter utilized in the cell search and the multipath estimation algorithms. In particular, the authors suggest a http://asp.eurasipjournals.com/content/2014/1/155 Figure 1 The D-decimated matched filter: input sequence R n and known sequence are decimated by a factor D. Therefore, frac1D MAC operations are pruned from the FIR filter, leading to a reduced computational load.
hierarchical matched filter able to reduce the utilization of hardware resources.
The solutions presented above propose implementation techniques for the complexity reduction or power reduction of the algorithm. In [17], a preliminary design of a reduced complexity algorithm for timing synchronization is proposed. In particular, two solutions were evaluated for the design of the matched filter: i) the utilization of a preevaluation of correlation values and ii) the decimation of the sequence in order to lighten the computational complexity of the matched filter. In this research work, we will focus on the second approach, since it is more suitable for the implementation over programmable hardware accelerators. In fact, the pre-evaluation solution is based on if-then-else statements and could potentially degrade the system performance.

Proposed algorithm
The proposed algorithm is based on the approximate computing paradigm. In particular, the computation of the correlation values is not exact but roughly estimated. In fact, the actual correlation value does not carry any meaningful information for the timing synchronization: the information resides in the fact that a particular value of the correlation function has risen over a given threshold. Through the computation of an approximate correlation function, it is then possible to obtain a reduction of the algorithm complexity and thus, a more energy-efficient solution.
Equation 1 can be approximated by dropping some of the multiply-accumulate (MAC) operations, which leads to a reduction of the overall computational complexity. To give a regularity to the pruning pattern, this research work considers the removal of operations on a regular pattern (e.g. 1 MAC operation every D). To better highlight the overall concept, Figure 1 presents a schematic view of the algorithm's data-flow. The incoming data R n is decimated by a factor D and then fed to the corresponding decimated version of the original matched filter. Therefore, from the incoming data stream and from the known sequence 1 sample, every D is not considered. This leads to an actual pruning of D MAC operations from Equation 1. As an example, for a D factor equal to two, Equation 1 can be rewritten as where L = L/2. The advantages introduced by this approach are twofold: i) the overall computation of the algorithm is reduced, leading to a reduced amount of energy spent for the computation of a single correlation value and ii) the actual sampling rate of the system can be reduced by a factor 1/D, leading to a further energy saving. Moreover, the reduction of the sample rate could be then paired with circuital solutions, such as dynamic voltage and frequency scaling (DVFS) to further enhance the energy and power efficiency [18]. The definition of the threshold for the detection plays a fundamental role in the overall performance of the system: an overly high threshold would lead to miss-detections while a downsized threshold would boost the false detection rate. Therefore, a careful planning of the detection threshold has to be considered.

Definition of the threshold
In order to identify the threshold, the distributions of the correlation values that are associated with a detection and the distribution of values that are not associated with a detection were studied. The study of these two distributions, for different SNR levels and for different decimation factors D, gives important information for the definition of the threshold. As an example, Figures 2 and 3 present the distributions obtained for SNR levels of −18.5 and −8 dB, respectively. As shown by the figures, for extremely low SNR levels, the two distributions are partially (if not totally) overlapping. Therefore, it is not possible to unequivocally identify a threshold that separates the two distributions, and therefore, either false detections or miss detections are expected, depending on how the threshold is set. However, for higher SNR levels, it is possible to separate the two distributions more effectively. Through an empirical study of the two distributions, the threshold for different SNR levels and for different D factors was set such that the probability of detection would be maximized. The threshold was set as the average value between the minimum correlation value of the positive-match distribution and the maximum value of the no-match distribution. In the case of overlapping distributions, the threshold set in this way would still minimize the miss-detection error.

Experimental results
The performance of the proposed approach was compared to the performance of a traditional matched filter to evaluate the relative performance and to validate the proposed design. The algorithms were tested in Matlab to obtain statistical figures for the algorithm performance. In particular, the analysis considered the implementation of the proposed algorithm for different decimation factors. Finally, on the basis of the statistical information, it is possible to realize an adaptable implementation of the proposed approach able to achieve the best performance at the lowest computational profile.
The proposed design for the threshold identification was tested at different SNR levels in ordtr to determine the behaviour of the algorithm in different operating conditions. In particular, SNR levels in the range (−20 dB, −8 dB) with a step size of 1.5 dB were considered. The choice of this range for the SNR was due to the desire to test the proposed algorithm in worstcase scenarios. Moreover, it is possible to identify the lowest SNR level where the performance degrades significantly if compared to the reference design. The simulations were iterated N times for each SNR level to obtain significant statistics on the performance. The number of iterations was determined empirically via preliminary simulations, analysing the convergence point of the number of miss-detections as a function of the number of iterations. Figure 4 shows the number of missdetections as a function of the number of iterations at SNR level −15 dB. For N greater than 1,000, the percentage of miss-detections oscillates around an average value which means that increasing the number of iterations does not give any further information for the statistical evaluation of the algorithm. Hence, N was fixed to 1,000.  The values of working frequency, supply voltage (Vdd) and estimated power consumptions were normalized respective to their maximums.

Performance analysis
For the performance analysis, two parameters were considered: the SNR level and the decimation factor D. In particular, the considered decimation factors D ranged from 2 to 5. Decimation factors larger than D = 5 would provide a limited complexity reduction of the algorithm which would diminish the benefit of the proposed solution. Figure 5 shows the characterization of the algorithm performance for different decimation factors and as a function of the SNR. Moreover, the figure reports the performance of the original algorithm. From the figure it is possible to notice that for a decimation factor D = 2, the algorithm performance is highly degraded. This is due to the almost complete overlapping of the two distributions for almost all of the SNR levels considered. Opposite is the case of larger decimation factors: For D = 4 or D = 5, the algorithm performance are in line with the traditional implementation of the matched filter.

Power consumption
The run-time adaptation of the decimation factor D produces a dynamic workload for the architecture responsible for the algorithm implementation. In this scenario, the utilization of dynamical power management techniques (e.g. DVFS) can potentially have a large impact on the power consumption, enabling the architecture to always run at the lowest power profile, minimizing power and energy consumption. In order to study the power saving obtained through the coupling of the proposed algorithm and DVFS, the implementation of the proposed algorithm on a reference architecture was considered. Details about the reference architecture can be found in [19].
The reference architecture was characterized at different supply voltages (Vdd) in order to latch the maximum operating frequency and power consumption with the different Vdd levels. Table 1 highlights the relation between the set voltage supply, maximum operating frequency and total power consumption. The characterization was obtained after a logical synthesis on a 45-nm technology node. The supply voltage, maximum operating frequency and total power consumption values were normalized to their respective maximum values.
It is now possible to latch together the algorithm complexity, the required working frequency and the estimated power consumption. Table 2 summarizes the achieved performance of the proposed algorithm, in terms of complexity reduction and power consumption. The table shows that the complexity reduction and operating frequency scales linearly with the decimation factor, while the total power consumption scales super-linearly with the decimation factor D.

Conclusions
In this research paper, we have proposed a reduced complexity algorithm for timing synchronization. The complexity reduction is based on the approximate computing paradigm. The study case of the implementation of a reduced complexity algorithm for timing synchronization in the W-CDMA system was considered. The study of different decimation factors D showed how the pruning of MAC operations impacts the overall performance of the system. Decimation factors of 4 and 5 proved to deliver performance in line with the traditional algorithm. However, at the same time, the computational load was reduced by a factor of 1/D, leading to more computationally efficient solutions. The algorithm performance together with the power characterization of a reference architecture for the implementation of the proposed algorithm showed that power consumption is significantly reduced already by 45% for a decimation factor D = 5.