# Approximate computing for complexity reduction in timing synchronization

- Roberto Airoldi
^{1}Email author, - Fabio Campi
^{2}and - Jari Nurmi
^{1}

**2014**:155

https://doi.org/10.1186/1687-6180-2014-155

© Airoldi et al.; licensee Springer. 2014

**Received: **28 February 2014

**Accepted: **8 October 2014

**Published: **16 October 2014

## Abstract

This paper presents the design and performance evaluation of a reduced complexity algorithm for timing synchronization. The complexity reduction is obtained via the introduction of approximate computing, which lightens the computational load of the algorithm with a minimal loss in precision. Timing synchronization for wideband-code division multiple access (W-CDMA) systems is utilized as the case study and experimental results show that the proposed approach is able to deliver performance similar to traditional approaches. At the same time, the proposed algorithm is able to cut the computational complexity of the traditional algorithm by a 20% factor. Furthermore, the estimation of power consumption on a reference architecture, showed that a 20% complexity reduction, corresponds to a total power saving of 45%.

## Keywords

## Introduction

The continuous growth in demand for bandwidth and mobility has contributed, in the past decades, to the development of a wide set of communication standards. Thus, multistandard, multimode transceivers have become the focal point of radio architects. Flexible radios were then introduced to tackle these new radio requirements [1]. In fact, providing a single architecture that can at run-time modify its behaviour and connect to different radio systems is truly appealing both for end users as well as for the industry of integrated circuits (IC): end users can benefit from it by carrying around a single device instead of a set of devices, while IC’s manufacturers can spread the design costs of a single platform over a wider range of applications. However, the introduced flexibility does not come for free.

A higher degree of flexibility is generally obtained through the utilization of software layers (e.g. software-defined radio (SDR) [2]) or via the utilization of reconfigurable hardware (e.g. reconfigurable radio systems [3]). These solutions are less power efficient than traditional ASICs. Furthermore, today’s and tomorrow’s platforms have to relay on the latest available silicon technology in order to reach the computational power required by latest standards. However, in ultra-deep-sub-micro (UDSM) technology nodes, power consumption is not any more a secondary constraint. In fact, power consumption plays a major role in the system reliability for UDSM technology nodes [4]. Thus, the combination of power inefficiency at the architectural level and power issues at the circuital level requires optimization at each possible level: from the algorithm design down to the physical implementation of the system.

At the algorithm level, many things can be done in order to improve the power efficiency. In fact, an algorithm that is able to take full advantage of the underlying architecture can more efficiently utilize the resources made available by the hardware. In flexible radios, the flexibility of the underlying hardware is generally utilized to enable swap of functionalities, protocol updates and so on. However, the given flexibility can also be utilized at algorithm level to enable more efficient power implementations. New solutions at algorithm level have to be found by analysing the application domain together with the hardware platform.

Kernels that do not require a high level of correctness are present in certain subsets of some applications. As an example, a system might be interested in knowing if a certain variable has risen over a given threshold or not. At the same time, the system might not find any useful information in the knowledge of the actual value of the variable. An accurate computation of the value requires the system to spend a certain amount of energy for guaranteeing the correctness of the computation. Dropping this *redundant control*, the system could provide an approximate value of the variable saving valuable energy. Many studies have shown that approximate computing is useful in a variety of application scenarios [5–9].

This research paper presents the design of a reduced-complexity matched filter for the implementation of the timing synchronization block. The aim of the design was to reduce the computational complexity while maintaining the overall performance at an acceptable level by dynamically adapting the matched-filter performance according to the estimated signal-to-noise-ratio (SNR). The complexity reduction is based on the approximate computing paradigm. The proposed algorithm is evaluated in different working scenarios (noise conditions) in order to validate its functionality and its performance.

## Timing synchronization algorithm

Timing synchronization is one of the most critical kernels at the receiver side. In fact, the overall performance of the system is highly dependent on the synchronization stage. In orthogonal frequency division multiplexing (OFDM) systems, as an example, errors in the timing synchronization lead to inter-symbol interference while frequency offsets cause inter-subcarrier interference [10].

Different communication systems rely on different synchronization algorithms. In any case, the synchronization algorithm is built around two major computational kernels: correlation and autocorrelation functions. These kernels are most often implemented as a matched filter. Depending on the chosen communication systems, the utilization of correlation is preferred over the autocorrelation, or vice versa. Moreover, different protocols might require different lengths of the (auto)correlation sequence.

Programmable architectures for the implementation of these kernels allow the utilization of different sequence lengths as well as the change between the computation of correlation or autocorrelation functions, in order to support many different standards. This inherent flexibility can then be utilized at algorithm level to reduce the algorithm complexity of the matched filter. In particular, the utilization of approximate computing can be utilized to reduce the amount of computation required for the calculus of the (auto) correlation value.

As proof-of-concept, the design of the reduced-complexity matched filter proposed in this work is based on the specification for the wideband-code division multiple access (W-CDMA) system [11]. However, the proposed methods could be also ported to other communication systems.

### Timing synchronization in W-CDMA systems

Timing synchronization for W-CDMA systems can be divided into two main algorithms: 1) *the cell search algorithm*[12] and 2) *the multipath delay estimation algorithm*[13].

#### Cell search algorithm

where *C*_{
n
} is the *n* th sample of the correlation value, *L* is the length of the known sequence (256 in this case), *R*_{n-i} is the (*n*-*i*)th sample of the incoming signal and ${\text{Coeff}}_{i}^{\ast}$ is the complex conjugate of the *i* th sample of the known sequence. Finally, the detection of peaks in the correlation sequence defines the alignment of slots and frame structures.

#### Multipath estimation algorithm

The multipath estimation algorithm estimates and compensates the multipath components. Multipath estimation is performed in two steps: i) the identification of slot boundaries and ii) the evaluation of the multipath components *via* the computation of a noncoherent average of the correlation function over the following *N* slots from the slot identification point. All these steps rely on Equation 1.

#### Related work

Previous research has addressed the improvement of W-CDMA timing synchronization in its different parts and targeting different aspects of the algorithms. However, most of the proposed solutions are based on optimization done at architectural level or at circuit level.

Li et al. in [14] present a low power design for the W-CDMA cell search. The work introduces a robust complexity algorithm for synchronization under large frequency and clock errors. The implementation of the algorithm is based on a pipelined search in order to increase the performance of the search [15]. Finally, the implementation on CMOS technology shows an achieved power saving of 51% with an area reduction of 31.9% over traditional solutions.

Korde et al. in [16] propose an improved design for the matched filter utilized in the cell search and the multipath estimation algorithms. In particular, the authors suggest a hierarchical matched filter able to reduce the utilization of hardware resources.

The solutions presented above propose implementation techniques for the complexity reduction or power reduction of the algorithm. In [17], a preliminary design of a reduced complexity algorithm for timing synchronization is proposed. In particular, two solutions were evaluated for the design of the matched filter: i) the utilization of a pre-evaluation of correlation values and ii) the decimation of the sequence in order to lighten the computational complexity of the matched filter. In this research work, we will focus on the second approach, since it is more suitable for the implementation over programmable hardware accelerators. In fact, the pre-evaluation solution is based on *if-then-else* statements and could potentially degrade the system performance.

## Proposed algorithm

The proposed algorithm is based on the approximate computing paradigm. In particular, the computation of the correlation values is not exact but roughly estimated. In fact, the actual correlation value does not carry any meaningful information for the timing synchronization: the information resides in the fact that a particular value of the correlation function has risen over a given threshold. Through the computation of an approximate correlation function, it is then possible to obtain a reduction of the algorithm complexity and thus, a more energy-efficient solution.

*D*). To better highlight the overall concept, Figure 1 presents a schematic view of the algorithm’s data-flow. The incoming data

*R*

_{ n }is decimated by a factor

*D*and then fed to the corresponding decimated version of the original matched filter. Therefore, from the incoming data stream and from the known sequence 1 sample, every

*D*is not considered. This leads to an actual pruning of

*D*MAC operations from Equation 1. As an example, for a

*D*factor equal to two, Equation 1 can be rewritten as

where *L*^{′} = *L*/2. The advantages introduced by this approach are twofold: i) the overall computation of the algorithm is reduced, leading to a reduced amount of energy spent for the computation of a single correlation value and ii) the actual sampling rate of the system can be reduced by a factor 1/*D*, leading to a further energy saving. Moreover, the reduction of the sample rate could be then paired with circuital solutions, such as dynamic voltage and frequency scaling (DVFS) to further enhance the energy and power efficiency [18].

The definition of the threshold for the detection plays a fundamental role in the overall performance of the system: an overly high threshold would lead to miss-detections while a downsized threshold would boost the false detection rate. Therefore, a careful planning of the detection threshold has to be considered.

### Definition of the threshold

*D*, gives important information for the definition of the threshold. As an example, Figures 2 and 3 present the distributions obtained for SNR levels of -18.5 and -8 dB, respectively. As shown by the figures, for extremely low SNR levels, the two distributions are partially (if not totally) overlapping. Therefore, it is not possible to unequivocally identify a threshold that separates the two distributions, and therefore, either false detections or miss detections are expected, depending on how the threshold is set. However, for higher SNR levels, it is possible to separate the two distributions more effectively. Through an empirical study of the two distributions, the threshold for different SNR levels and for different

*D*factors was set such that the probability of detection would be maximized. The threshold was set as the average value between the minimum correlation value of the positive-match distribution and the maximum value of the no-match distribution. In the case of overlapping distributions, the threshold set in this way would still minimize the miss-detection error.

## Experimental results

The performance of the proposed approach was compared to the performance of a traditional matched filter to evaluate the relative performance and to validate the proposed design. The algorithms were tested in Matlab to obtain statistical figures for the algorithm performance. In particular, the analysis considered the implementation of the proposed algorithm for different decimation factors. Finally, on the basis of the statistical information, it is possible to realize an adaptable implementation of the proposed approach able to achieve the best performance at the lowest computational profile.

*N*times for each SNR level to obtain significant statistics on the performance. The number of iterations was determined empirically

*via*preliminary simulations, analysing the convergence point of the number of miss-detections as a function of the number of iterations. Figure 4 shows the number of miss-detections as a function of the number of iterations at SNR level -15 dB. For

*N*greater than 1,000, the percentage of miss-detections oscillates around an average value which means that increasing the number of iterations does not give any further information for the statistical evaluation of the algorithm. Hence,

*N*was fixed to 1,000.

### Performance analysis

For the performance analysis, two parameters were considered: the SNR level and the decimation factor *D*. In particular, the considered decimation factors *D* ranged from 2 to 5. Decimation factors larger than *D* = 5 would provide a limited complexity reduction of the algorithm which would diminish the benefit of the proposed solution.

*D*= 2, the algorithm performance is highly degraded. This is due to the almost complete overlapping of the two distributions for almost all of the SNR levels considered. Opposite is the case of larger decimation factors: For

*D*= 4 or

*D*= 5, the algorithm performance are in line with the traditional implementation of the matched filter.

### Power consumption

The run-time adaptation of the decimation factor *D* produces a dynamic workload for the architecture responsible for the algorithm implementation. In this scenario, the utilization of dynamical power management techniques (e.g. DVFS) can potentially have a large impact on the power consumption, enabling the architecture to always run at the lowest power profile, minimizing power and energy consumption. In order to study the power saving obtained through the coupling of the proposed algorithm and DVFS, the implementation of the proposed algorithm on a reference architecture was considered. Details about the reference architecture can be found in [19].

**Characterization of the maximum operating frequency and power consumption as a function of supply voltage**

Normalized | Normalized | Normalized |
---|---|---|

supply voltage (Vdd) | maximum operating | power consumption |

frequency | ||

0.5 | 0.30 | 0.08 |

0.55 | 0.38 | 0.12 |

0.6 | 0.47 | 0.16 |

0.65 | 0.56 | 0.22 |

0.7 | 0.64 | 0.28 |

0.75 | 0.71 | 0.36 |

0.8 | 0.78 | 0.44 |

0.85 | 0.84 | 0.55 |

0.9 | 0.90 | 0.67 |

0.95 | 0.95 | 0.81 |

1.00 | 1.00 | 1.00 |

*D*.

**Proposed algorithm’s performance in terms of complexity reduction and power consumption for different decimation factors**
D

D factor | Algorithm complexity | Normalized working | Normalized | Normalized estimated |
---|---|---|---|---|

reduction | frequency | Vdd | power | |

2 | 1/2 | 0.50 | 0.65 | 0.22 |

3 | 1/3 | 0.67 | 0.75 | 0.36 |

4 | 1/4 | 0.75 | 0.8 | 0.44 |

5 | 1/5 | 0.80 | 0.85 | 0.55 |

## Conclusions

In this research paper, we have proposed a reduced complexity algorithm for timing synchronization. The complexity reduction is based on the approximate computing paradigm. The study case of the implementation of a reduced complexity algorithm for timing synchronization in the W-CDMA system was considered. The study of different decimation factors *D* showed how the pruning of MAC operations impacts the overall performance of the system. Decimation factors of 4 and 5 proved to deliver performance in line with the traditional algorithm. However, at the same time, the computational load was reduced by a factor of 1/*D*, leading to more computationally efficient solutions. The algorithm performance together with the power characterization of a reference architecture for the implementation of the proposed algorithm showed that power consumption is significantly reduced already by 45% for a decimation factor *D* = 5.

## Declarations

### Acknowledgements

The authors would like to thank M.Sc. Emma Jokinen (Aalto University - Espoo, Finland) for the inputs and for the nice discussions on the topic that have led to the results presented in this research work.

This work was funded by the Academy of Finland under contract no. 258506 (DEFT: Design of a Highly-parallel Heterogeneous MP-SoC Architecture for Future Wireless Technologies).

## Authors’ Affiliations

## References

- Masera G, Baghdadi A, Kienle F, Moy C: Flexible radio design: trends and challenges in digital baseband implementation.
*VLSI Des*2012., 2012: Article ID 549768. doi:10.1155/2012/549768Google Scholar - Reed J:
*Software Radio - A Modern Approach to Radio Engineering*. Prentice-Hall, Englewood Cliffs, NJ; 2002.Google Scholar - Dejonghe A, Bougard B, Pollin S, Craninckx J, Bourdoux A, Ven der Perre L, Catthoor F: Green reconfigurable radio systems.
*IEEE Signal Process. Mag*2007, 24(3):90-101.View ArticleGoogle Scholar - Oates AS: Reliability challenges for the continued scaling of IC technologies. In
*Proc. of 2012 IEEE Custom Integrated Circuits Conference (CICC)*. San Jose, CA; 9–12 Sept 2012:1-4.View ArticleGoogle Scholar - de Kruijf M, Nomura S, Sankaralingam K: Relax: an architectural framework for software recovery of hardware faults. In
*Proc. of the 37th Annual International Symposium on Computer Architecture, ISCA*. ACM,, New York; 2010:497-508.Google Scholar - Li X, Yeung D: Application-level correctness and its impact on fault tolerance. In
*Proc. of the IEEE 13th International Symposium on High Performance Computer Architecture, (HPCA)*. Scottsdale, AZ; 10–14 Feb 2007:181-192.Google Scholar - Misailovic S, Sidiroglou S, Hoffmann H, Rinard M: Quality of service profiling. In
*Proc. of the 32nd ACM/IEEE International Conference on Software Engineering, (ICSE) - Volume 1*. ACM, New York; 2010:25-34.Google Scholar - Raghunathan A, Roy K: Approximate computing: energy-efficient computing with good-enough results. In
*Proc. of the IEEE 19th International On-Line Testing Symposium (IOLTS)*. Chania; 8–10 July 2013:258-258.Google Scholar - Sampson A, Dietl W, Fortuna E, Gnanapragasam D, Ceze L, Grossman D: EnerJ: approximate data types for safe and general low-power computation. In
*Proc. of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation. PLDI ‘11*. ACM,, New York; 2011:164-174.View ArticleGoogle Scholar - Hanzo L, Keller T:
*OFDM and MC-CDMA: A Primer*. Wiley, New York; 2006.View ArticleGoogle Scholar - UMTS . Accessed 15 Oct 2014 http://www.umtsworld.com/technology/wcdma.htm
- Bahl SK: Cell searching in WCDMA.
*IEEE Potentials*2003, 22(2):16-19.MathSciNetView ArticleGoogle Scholar - Grayver E, Frigon J-F, Eltawil AM, Tarighat A, Shoarinejad K, Abbasfar A, Cabric D, Daneshrad B: Design and VLSI implementation for a WCDMA multipath searcher.
*IEEE Trans. Veh. Tech*2005, 54(3):889-902. 10.1109/TVT.2005.844664View ArticleGoogle Scholar - Li C-F, Chu Y-S, Sheen W-H: Low-power design for cell search in W-CDMA. In
*Proc. of International Symposium on Circuits and Systems ISCAS, vol. 4*. Vancouver, Canada; 23–26 May 2004.Google Scholar - Wang Y-PE, Ottosson T: Cell search in W-CDMA.
*IEEE J. Sel. Area Comm*2000, 18(8):1470-1482.View ArticleGoogle Scholar - Korde MS, Gandhi AS: Improved design for slot synchronization in WCDMA cell search. In
*Proc. of 2012 International Conference on Advances in Mobile Network, Communication and Its Applications (MNCAPPS)*. Bangalore; 1–2 Aug 2012:75-78.View ArticleGoogle Scholar - Airoldi R, Nurmi J: Design of a matched filter for timing synchronization. In
*Proc. of the 2013 Conference on Design and Architectures for Signal and Image Processing (DASIP)*. Cagliari; 8–10 Oct 2013:247-251.Google Scholar - Ma D, Bondade R: Enabling power-efficient DVFS operations on silicon.
*IEEE Circ. Syst. Mag*2010, 10(1):14-30.View ArticleGoogle Scholar - Campi F, Airoldi R, Nurmi J: Design of a flexible, energy efficient (Auto)correlator block for timing synchronization. In
*Proc. of IEEE Computer Society Annual Symposium on VLSI ISVLSI’14*. Tampla, Fl; 9:1-6.Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.