FPGA Implementation of an MUD Based on Cascade Filters for a WCDMA System

The VLSI architecture targeted on FPGAs of a multiuser detector based on a cascade of adaptive ﬁlters for asynchronous WCDMA systems is presented. The algorithm is brieﬂy described. This paper focuses mainly on real-time implementation. Also, it focuses on a design methodology exploiting the modern technology of programmable logic and overcoming the limitations of commercial tools. The dedicated architecture based on a regular structure of processors and a special structure of memory exploiting FPGA architecture maximizes the processing rate. The proposed architecture was validated using synthesized data in UMTS communication scenarios. The performance goal is to maximize the number of users of di ﬀ erent WCDMA data tra ﬃ cs. This dedicated architecture can be used as an intellectual property (IP) core processing an MUD function in the system-on-programmable-chip (SOPC) of UMTS systems. The targeted FPGA components are Virtex-II and Virtex-II Pro families of Xilinx.


INTRODUCTION
The third generation (3G) of mobile wireless communication is adopted for high-throughput services and the effective utilization of spectral resources. This work focuses on Universal Mobile Universal Telecommunications systems (UMTS). In UMTS Systems, the wideband code-division multiple-access (WCDMA) scheme is adopted. The desired data throughputs for 3G UMTS systems are 144 kbps for vehicular, 384 kbps for pedestrian, and 2 Mbps for indoor environments [1,2]. The receivers in 3G systems must take into account not only intersymbol interferences (ISI), but also more importantly multiple-access interferences (MAIs) which increase radically in the number of users and data rates. Multiuser detectors (MUDs) are applied to eliminate the MAI and become essential for an efficient 3G wireless network systems deployment [3]. The algorithmic aspect of MUD has become an important research issue over the last decade (e.g., [3][4][5][6]). Moreover, the real-time implementation aspect of MUDs is also well documented (e.g., [6][7][8][9]). The rapid prototyping targeted on field-programmable gate arrays (FPGAs) is also proposed [10][11][12]. These works demonstrate several limitations in practical systems in terms of timing and algorithm and hardware constraints (e.g., arithmetic complexity, memory access requirements, data flow) [5][6][7]. Moreover, no work was done to maximize the number of users on a chip (or a device in case of FPGAs). Maximizing the number of users makes it possible to increase the capacity of a cell and multiantenna processing.
Because minimum-mean-square-error (MMSE)-based receivers allow for a significant gain in performance, the adaptive two-stage linear cascade filter MUD (CF-MUD) based on MMSE receivers proposed in [13] offers a good tradeoff between performance and complexity. This algorithm presents a low-complexity and suitable regularity aspects for FPGA implementation. The CF-MUD is based on two blocks, signature and detection, which will be briefly described in Section 2. Each block acts as a filter in order to cancel the ISI and MAI. In previous works [14,15], FPGA implementations of the signature block were presented. Based on the CF-MUD algorithm, this paper describes a complete design architecture targeted on the recent FPGA components-the Virtex-II and Virtex-II Pro of Xilinx including signature and detection blocks.
The rest of the paper is organized as follows. Section 2 presents a brief description of the system model and the adaptive MUD algorithm considered in this paper. Section 3 introduces the VLSI architecture of the present MUD targeted on the Virtex-II and Virtex-II Pro components. Section 4 describes the implementation methodology and Section 5 presents the results. Section 6 presents a few conclusions.

DS-CDMA baseband model
In a direct-sequence CDMA (DS-CDMA) baseband system model, we consider K mobile users transmitting symbols from the alphabet Ξ = {−1, 1}. Each user's symbol is spread by a pseudonoise (PN) sequence of length N c called the specific signature code. T denotes the symbol period and T c denotes the chip period, where N c = T/T c is an integer. User k's nth transmitted symbol is b (n) k . The base transceiver station (BTS) received signal in baseband can then be written as follows: where t denotes the time; L k is the number of propagation paths; h (n) k,l and τ k,l are, respectively, the complex gain and the propagation delay of the path l for user k; N b represents the number of the transmitted symbols, A k is the transmitted amplitude of user k; s (n) k is the specific signature of user k; and η(t) is the additive white Gaussian noise (AWGN) with variance σ 2 η . To increase the performance and capacity of communication systems, the ISI and MAI must be minimized. It is therefore essential to design MUD processing able to cancel these interferences. The following gives a brief description of the CF-MUD [13].

Cascade filter multiuser detector
The block diagram of the multiuser detector CF-MUD to be implemented on an FPGA is shown in Figure 1 [13].We can distinguish two blocks: signature and detection. Each block acts as an adaptive filter for canceling the ISI and MAI. The proposed linear adaptive MUD is based on the leastmean-square (LMS) adaptation method. This filter, however, needs data training sequences to adapt the filter coefficients. Compared to time-division multiple-access (TDMA) used in Global Systems for Mobile communications (GSM) systems, UMTS systems do not give access to preknown data with the exception of pilot bits-in order to adjust the filter coefficients. It is important to note that to assure the convergence, both block filters need more than the pilot bits available in fast-fading context. Preknown data training sequences r train are internally generated based on channels parameters (amplitudes and delays) obtained from the channel-estimation technique.
The principle of CF-MUD is briefly described in Figure 2. The switch models the training phase and detection phase. The first block of the CF-MUD, the signature block, adapts the signatures of the users without prior knowledge of their PN codes. In the first step, we synchronized the received signal r(n) based on the estimated propagation delays for each user.
The following notations are used: x is the estimated value of x; y k (n) is the adaptation output of user k; w k (n) is the vector of filter coefficients of user k; b train k (n) is the synthetic transmitted training data sequence; r train (n) is the synthetic received training data vector generated from the b train k (n) transmitted through estimated channel parameters; α k (n) is the adaptation error of the signature; and μ is the adaptation step of adaptive filters in the signature block.
The detection block aims to suppress the residual MAI and ISI based on the data of all users estimated using the output signal of the signature block. From all users, we formed a vector y T (n) at the output of the signature block as follows: In the training phase, we used the following set of equations for user k (for k = 1, 2, . . ., K): where v Tk (n)=[ v 1,k (n), v 2,k (n), . . . , v 3K,k (n)] T , dim( v Tk (n)) = dim( y T (n)) = 3K × 1, o k (n) is the adaptation output of user k corresponding to the output of the respective adaptive filter, v Tk (n) is the filter coefficient vector of user k, β k (n) is the adaptation error of detection, and ν is adaptation step of adaptive filters in the detection block. In the detection phase, the transmitted data of mobile users are estimated by the signature block from following equation: Regarding the detection block, the transmitted data of users are estimated by the following equation: Finally, the estimated bits b k (n) are found by simply taking the sign function of o k (n), When the adaptation process was completed, we applied (8), (9), and (10) to propagate the signal r(n) through the CF-MUD. Figure 3 depicts algorithmic performance in terms of the block error rate (BLER) of CF-MUD algorithm compared with the RAKE receiver and soft multistage parallel interference canceler (Soft-MPIC) in a WCDMA platform [3]. Simulation results were done for one antenna, in perfect channel estimation, Vehicular A channel defined by International Telecommunication Union (ITU) [16] 3 km/h mobile speed, 64 kbps data rate, and 15 users. We observed a gain of 1.9 dB to target a BLER of 10% for CF-MUD compared with Soft-MPIC and the RAKE receiver cannot reach the BLER of 10%. No decision feedback has been considered for CF-MUD and Soft-MPIC. Although MUD with decision feedback is considered superior than without the decision feedback creates a serious data dependency to parallelize the implementation on many devices.

Performance evaluation of CF-MUD
Based on CF-MUD equations (2)-(10), the proposed FPGA-targeted architecture can be described as in Section 3.

VLSI ARCHITECTURE TARGETED ON FPGA
The developed architecture should be reconfigurable to several baseband processing UMTS systems characterized by the number of users K and different communication scenarios in different mobile speeds. Thus, it can be reconfigured by respecting WCDMA, hardware, and algorithmic constraints. The main WCDMA constraints [2] are data rates, that is, orthogonal variable spreading factor (OVSF) of 64, 16, 8, or 4 corresponding, respectively, to 12.2 kbps (voice rate), 64 kbps, 144 kbps, and 384 kbps data rates; a time frame of 38400 chips in 10 milliseconds; and a mobile speed of 3 km/h to 100 km/h.  The main algorithmic constraints, with respect to MUD performance, consist of the number of adaptation iterations in the signature filter and detection filter, adaptation steps μ and ν, quantification scales to respect the arithmetic precision in fixed point.
The main hardware constraints take into account the limitations of targeted FPGAs in term of number of dedicated multipliers, number of block RAMs (BRAMs), and memory size of each BRAM [17].
These constraints were also used in our method of resource estimation before synthesis. The architecture must be able to respect real-time constraints bounded by time frame to detect all data frames, and by adaptation time to adapt all coefficients ( w and v) depending on the mobile speed.
The block diagram of the pipelined architecture is based on two stages of the modular array structure of processing elements (PEs) shown in Figure 4. Figure 5 illustrates the mapping of CF-MUD algorithm on array of PEs and internal memories (inside the FPGA). These PEs consist of optimized cores performing adaptive filtering defined by (2)-(4) which we called PE LMS including straightforward filtering defined by (2) which we called PE FIR . The regularity of the CF-MUD makes it possible to time multiplex a number of users, that is, we used only one PE to process a number of users by time multiplexing selection. The time multiplexing, that is, number of users per PE, in the signature and detection blocks is defined by T MUX 1 and T MUX 2 , respectively. Thus, the number of PE LMS and PE FIR inside each block is the same, and is represented by N MUX 1 and N MUX 2 for the signature and detection blocks, respectively. All PEs consider normalizedfixed complex-value signals and use the same time multiplexing.
The data and address paths are independent to permit maximum simultaneous direct access to data and address. Two different external memories SDRAM and two different memory buffers (InputBuffer and OutputBuffer) are used to allow independent access to input/output, and thus to maximize the multiple path access to external input/output. These memory buffers are implemented by the LUT (lookup-table) -based distributed memory of FPGAs. The memory buffers InputBuffer and OutputBuffer are multiport. The buffer In-ternalBuffer is used to memorize intermediate results from the signature filter and input to the detection filter. It is implemented by LUT-based distributed memories. The firstin first-out (FIFO) buffers Serial2Parallel and Parallel2Serial are used to minimize the utilization of input-output (IO) pins of FPGA and also to minimize the number of external memories. These buffers are implemented by LUT-based distributed memory of FPGAs as well. The PE of the architecture uses the semiglobal internal BRAM-based memories, that is, a certain number of PEs have access to the same memory. This number is defined by the possible time multiplexing determined from the architectural specification step.
We used an advanced scheduling based on time multiplexing by modifying the conventional methods, that is, As Soon As Possible (ASAP) and As Late As Possible (ALAP). This advanced scheme relies on the fact that ASAP gives low latency while ALAP gives high latency but uses less hardware resources [18]. Modifying jointly these two methods permits to balance the latency while exploiting the particular features of targeted FPGAs. The constraints of this scheduling involve using only two real dedicated multipliers and minimum number of multiplexers and other arithmetic operators (adders). This method exploits the symmetric structure of these FPGA components, especially the shared connection between BRAMs and the dedicated multipliers. Using two real multipliers to implement complex multiplication including four real multiplications permits to use this shared connection between dedicated multiplier and BRAM. Minimizing the number of multiplexers leads to a reduction in the critical path of circuit. InterBufferr · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Figure 5: Mapping the CF-MUD on processing elements and internal memories.
The fine-grain pipeline of PEs, shown in Figure 6(a), uses dedicated 2-level pipelined multipliers available on the silicon die of Xilinx FPGA devices. To understand the PE functionality, consider the complex-number multiplication described by (2) as follows. The summation is up to N T , which is N C for signature filters and 3K for detection filters: And to update the coefficients of (3) in (4), where (x) and (x) define the real and imaginary parts of x, and R re and R im represent the accumulation registers for real and imaginary parts. Figure 6(b) illustrates the scheduling and register-transfer logic (RTL) mapping of PE LMS , including PE FIR , to implement the complex-number filter using two real-number multipliers, where Ax and Mx (x = 1, 2, 3) are, respectively, 6 EURASIP Journal on Applied Signal Processing the adder and the multiplier units. Unit A1 is an adder-subtracter that is used for addition or subtraction in the real part of (2). Unit A3 is subtracter operation that is used to calculate the error adaptation in (3). Saturation is used at the output of these operational units to maintain the length of the data bus. In this figure, the subscripts "r" and "i" represent the real and the imaginary parts of the variables, respectively. Registers R re (R im ) and R 0 (R 1 ) correspond to (w k,i (n)) ( (w k,i (n))) and (w k,i (n + 1))( (w k,i (n + 1))), respectively. Registers R 0 (R 1 ) are used as pipelined registers allowing for two concurrent additions in multiplier-accumulator (MAC) and complex multiplications in (2), (4). Two registers are added before inputs of adders Ax to pipeline without hazard. The IO of PE can be registered or not. The fact that IO can be registered or not helps the processor to interface with other components of the system. The shift-to-right operation is represented by . This shift operation allows to implement the hardware-free multiplication by adaptation step μ and ν whose value are of 2 −n . The execution time of an adder is one clock cycle (T clk ) and that of a multiplier is 2 cycles. Regarding N complex taps filters, the throughput in terms of clock cycles of adaptation process is (2N +5) and of detection process is (2N +4). Thus, the throughput for the PE LMS (including adaptation process and detection process) and PE FIR (including detection process only) of are, respectively, (3N + 9) and (2N + 5). As a result, the throughputs of signature block and detection block are, respectively, (3N C + 9), (2N C + 5) and (9K + 9), (6K + 5).
The coarse-grain pipeline data-flow strategy in the system level of the architecture is detailed in Figures 7 and 8 for the adaptation and detection processes, respectively. The strategy depends on the processing time between signature block, detection block, and the adaptation and detection processes.

IMPLEMENTATION METHODOLOGY
This paper focuses on the hardware (HW) design flow of the MUD based on a library of the hard optimized IP cores; for example, complex-taps FIR filters used as PE for the adaptive MUD. It is necessary to estimate the timing performance and HW resources required by architectures from the architectural specifications satisfying these constraints. To reach the maximum number of users (K) for two family devices of Xilinx, a program based on nonlinear integer-programming model was developed. This nonlinear integer-programming is resolved by the branch-and-bound method [19]. The nonlinear integer-programming model makes it possible to estimate the performance requirements and the limitations of FPGA HW resources. This tool is used to maximize the time multiplexing (number of users in one PE) and timing performance (number of clock cycles) of the system, while respecting algorithmic constraints and HW resource limitations (number of multipliers and RAM block). It is also necessary to minimize the clock rate for power consumption. The program is helpful for choosing a type of suitable architecture in terms of pipeline strategy for the algorithmic specification of MUD. This tool can also be conversely used to estimate the necessary HW resources and timing performance.
For the specific developed architecture of the CF-MUD algorithm targeted on these FPGA devices (Virtex-II Pro and Virtex-II), the objective functions are to maximize the number of users K MAX described by the nonlinear inequalities as follows: Respecting the following constraints, (14) and T MUX 2 is an integer satisfying the pipeline strategy of the HW architecture.
Where N MEM is the number of data by BRAM, N chip is the number of chip, N m is the maximum number of dedicated multipliers available on silicon die of these FGPA components [17], N cycle is the number of cycle (throughput) to solve the CF-MUD on FPGA (Section 3), and N A1 and N A2 are the number of adaptation iterations in the signature and detection block, respectively. We consider that the variables N A1 , N A2 , OVSF, and t are constraints. These above inequalities defined by straightforward functions f (•) and g(•), from (13) and (14), are built by taking constraints stated on Section 3 and the dedicated FPGA architecture.
Since verification is critical in the design flow, dynamic verification by simulations is used throughout. The results of fixed-point simulations high-level language (Matlab) provide a static functional reference for the HW verification of the architecture. The synthesized data are used for the verification in Matlab as well as in FPGA devices implementation.

RESULTS
HW architecture is targeted on the Virtex-II and Virtex-II Pro components of Xilinx to satisfy different algorithmic and WCDMA specifications in real time.
Tables 1 and 2 summarize the maximum number of simultaneous users (K MAX ) that can be processed in monorate on different devices of the Virtex-II and Virtex-II Pro families in different data based on the UMTS 3G standard. The data throughputs are fixed by the OVSF parameter such as 64, 16,8, and 4 corresponding, respectively, to 12.2 kbps (voice rate), 64 kbps, 144 kbps, and 384 kbps (the last three throughputs are for data) [2]. We assumed three mobile speeds: slow fading (T A = 40 milliseconds), medium fading (T A = 10 milliseconds), and fast fading (T A = 2 milliseconds), where T A represents the allowed adaptation time of CF-MUD coefficients ( w and v) [20]. Considering the short code of 256 chips, the number of adaptation iterations is 100(256/OVSF) for each user k of the signature and detection block. We used the same number of adaptation iterations for hardware estimation.
While the allowed adaptation time constraint varies with the mobile speed, the allowed detection time is always limited by 10 milliseconds, which is the timing length of a frame       Tables 3 and 4 summarize the utilization ratio of resources on targeted devices corresponding to the estimated maximum number of users given in Tables 1 and 2, respectively. We observed that the utilization ratio of resources in case of fast-fading scenario is low (indicated in gray zones). This is because the adaptation time decreases an impose to fix T MUX 1 and T MUX 2 to equal 1. Thus, we are limited by few resources. But we can easily increase the number of users by only duplicating the same architecture on the device. Hence, we can easily increase K MAX in fast-moving conditions.
Note that in these results, the users transmit simultaneously in the same sector. Normally, we should consider the number of user lower than the value of the OVSF. Thus, the number of user higher than the value of the OVSF should be distributed on the other sectors of the BTS. Under these conditions, the number of users by BTS (3 sectors) should be higher than the data indicated in Tables 1 and 2. According to the pipeline strategy of developed architectures, the total time needed to process a data frame is restricted by the maximum execution time in the signature and detection blocks. In the signature block, the performance in terms of adaptation time (t A1 ) and detection time (t D1 ) is, respectively, defined by In the detection block, we have With the pipeline strategy of architecture, the time processing in each cascade filter is, respectively, max(t A1 , t D1 ) and max(t A2 , t D2 ), and it needs to be inferior to T A for adaptation depending on slow-, medium-, and fast-fading communication situations. Table 5 summarizes the results of an experiment system for 16 users after routing and placing by the Xilinx physical tool (the ISE foundation) on the Virtex-II Pro component XC2VP30. The results for the data rate in fast-fading conditions are excluded for the system of 16 users because of the 10 EURASIP Journal on Applied Signal Processing Table 3: Utilization ratio of hardware (%) for K MAX of Table 1 on different devices of Virtex-II Pro family.   OVSF   Device  Slow fading  Medium fading  Fast fading  64  16  8  4  64  16  8  4  64  16  8  4  XC2VP2  93  97  98  88  97  88 100 89  79  83  83  39  XC2VP4  100 100 95 100 100 100 95 100 100 71  57  36  XC2VP7  96  95  98  95  95  95  95  97  98  95  68  23  XC2VP20  98  98  98  99  98  99  97  97 100 82  34  11  XC2VP30  90 100 88  97 100 97  99  94  76  70  22  7  XC2VP40  89 100 100 99 100 99  92  67 100 78  80  84  93  79  93  85  90  54  67  0  0  XCV80  95  98  91  85  98  85  100  88  86  75  58  58  XCV250  98  96  100  90  96  90  97  97  89  83  67  42  XCV500  96  98  99  100  98  100  83  88  87  94  100 31  XCV1000  99  98  92  99  98  99  98  100  90  90  75  25  XCV1500  99  100  98  97  100  97  100 100  97  100  62  21  XCV2000  95  98  100  92  98  92  95  98  95  96  54  18  XCV3000  97  99  100 100  99  100  99  100 100 100  31  10  XCV4000  99  100 100  92  100  92  100  80  87  80  25  8.3  XCV6000  100  94  100  92  94  92  97  89  72  7  21  6.9  XCV8000  100 100 100  79  100  78  98  76  100  57  18 6.0 limitation of the present architecture in terms of maximum numbers. Again, we can find a slight difference in terms of hardware resources (number of slices) between the results after synthesis in Table 5 and the results before synthesis by our resource-estimator tool in Table 1. This was explained in Section 4 by the absence of database for FPGA components. We consider only the number of multipliers and BRAMs in our integer nonlinear programming model. Moreover, even with knowledge of the database, the resource estimation before synthesis is still difficult [21]. Nevertheless, for the main resources, the number of multipliers and BRAMs are exactly the same as in Table 1.

CONCLUSIONS
The HW architectures of a multiuser detector based on a cascade of adaptive filters (CF-MUD) for WCDMA systems were developed. The CF-MUD based on FIR using an LMS adaptation process presented a good choice for targeting FPGA devices. We have exploited the implementation advantages of the algorithm and the particular features of Xilinx devices. The regularity and recursiveness of the CF-MUD algorithm offer the opportunity to maximize the utilization ratio in the resource of the FPGA device. Using real-time implementation and taking into account all UMTS constraints, we demonstrated a utilization ratio in the resource near to 100% to maximize the parallelism of the CF-MUD algorithm. These dedicated architectures can be used later as optimized IP cores performing MUD functions. The current HW architectures are purely glue logic. Future work will consist of exploiting software processing in the multirate CF-MUD as a whole respecting the constraint specifications of the 3G wireless communications.