Gapless DSP applications generally require high throughput to process input streams without missing data points and while reliably avoiding memory overflow. In this section, we demonstrate algorithm- and implementation-based optimization methods to help address these multi-faceted implementation constraints. Taking the dataflow graph presented in Section 4 as a starting point, we improve the design by applying a sequence of optimizations. These optimization techniques are described in Section 5.1 through Section 5.3. Experimental results from applying these optimization are then presented in Section 6.
5.1 Window size optimization
In this section, we discuss optimized, dynamic configuration of the the window size parameter Ws, which was introduced in Section 4.1. In our deep jitter measurement system, the window size, along with sorting-related parameters (discussed in Section 5.2) that are directly influenced by Ws, has significant impact on trade-offs among measurement accuracy, execution time performance, and memory requirements.
In jitter measurement systems, the frequencies of the input signals are typically not known at design time and vary dynamically at run-time. A larger window size in general improves the accuracy of signal frequency and TIE estimation. For lower frequencies (larger clock periods), a larger window size is preferred to encapsulate a sufficient number of signal periods per signal frame. Larger window sizes also provide improved accuracy, as demonstrated in [16]. Larger window sizes also improve throughput.
However, memory requirements increase linearly with the window size. Thus, we initialize execution of our jitter measurement system to support an initial minimal frequency of finit, and we increase the window size dynamically if we encounter signals that have lower estimated frequency levels than the currently supported minimum frequency.
More specifically, in our deep jitter measurement system, the window size is dynamically optimized by monitoring the number of high/low signal transitions found in each window. If the number of transitions falls below a threshold \(C_{\text {trt\_num}}\), then the window size for subsequent signal frames is doubled.
In our experiments, we use finit=130 kHz, and we use the empirically determined value of \(C_{\text {trt\_num}} = 32\) transitions per frame. The value of \(C_{\text {trt\_num}}\) can be varied to tune system-level trade-offs—lower threshold values lead to lower memory requirements and faster execution time at the expense of decreased accuracy of gapless signal analysis.
5.2 Sorting optimization
Sorting operations are involved in two actors of our jitter measurement system, the DVL and RE actors. These operations account for significant portions of the overall computation in a given dataflow graph iteration. We employ bitonic sort [18] in an effort to enhance the efficiency of the sorting process.
To further improve the efficiency of sorting, we sort only part of the relevant data associated with each signal frame and perform the required analysis on the partially sorted data. This again represents a way to trade-off reduced accuracy for improved real-time performance. We configure the optimized sorting process carefully to ensure that the reduction in accuracy stays within a reasonable level.
In the DVL actor, the input data in a given signal frame is sorted to select high and low voltage thresholds. These thresholds are then used to find the high-to-low and low-to-high signal transitions in the given frame. We randomly select a subset of the data samples in each data frame to sort. The size SDVL of this subset is determined as
$$ S_{\text{DVL}} = \text{power}(k_{\text{DVL}} \times \text{ceil}(W_{s} / N_{\text{trans}})), $$
(2)
where kDVL is a positive integer parameter, ceil(x) gives the smallest integer that is greater than or equal to the real-valued argument x, power(y) gives the smallest power of two that is greater than or equal to the integer argument y, and Ntrans is the number of signal transitions that were detected in the previous frame. In other words, (SDVL/Ws) gives the fraction of available samples that are used in the sorting process.
For example, suppose that kDVL=4, Ws=65,536, and Ntrans=135, then:
$$ \begin{aligned} S_{\text{DVL}} &= \text{power}(4 \times \text{ceil}(65536 / 135)) \\ &= \text{power}(4 \times 486) = 2^{11} = 2048. \end{aligned} $$
(3)
In each firing of the RE actor, a sorting operation is performed as part of the process for deriving a rough clock period estimate. In each signal frame, the differences in pairs of neighboring transition times are sorted, and the 25th percentile of the sorted transition time differences is taken as the rough estimate.
Here, we use a threshold CRE to determine the size SRE of the subset (of all transition time differences) that is sorted. If Ntrans>CRE, then SRE is set to CRE for the current frame; otherwise, SRE is set to Ntrans.
In our experiments, we use kDVL=4 and CRE=1,024. Through experimentation, we have determined these values to provide improvements in sorting efficiency without significantly degrading jitter measurement accuracy.
5.3 Throughput optimization
In this section, we focus on further methods that we have applied to optimize the throughput of computationally intensive actors in the proposed deep jitter measurement system. As discussed previously, we targeted our implementation to a hybrid CPU-GPU platform with C and OpenCL as the actor implementation languages for CPU- and GPU-based mapping, respectively.
All of the computationally intensive actors in our jitter measurement system employ GPU acceleration. Specifically, the following actors employ GPU kernels: DAT, DVL, STR, FSM, TRT, RE, RRE, LFT, and TSD. However, some GPU-mapped operations are not fully parallelized [16]. In particular, sorting, prefix sum, and reduction operations significantly limit the performance of several actors. Both the DVL and RE actors involve sorting, the TRT actor includes prefix sum computation, and the RRE and LFT actor include reduction operations.
For the RE and DVL actors, we described in Section 5.2 how we employed approximate computing techniques that trade-off acceptable decrease in accuracy for improvement in execution time. In addition to these techniques, we employ dynamic configuration of the vectorization degree to further improve processing efficiency.
By the vectorization degree of a kernel, we mean the number of data parallel instances of a kernel that are launched simultaneously. In OpenCL terminology, the vectorization degree is commonly referred to as the number of global work items. Careful optimization of vectorization degrees can have major performance benefit for GPU acceleration of dataflow graphs [19].
For the sorting operation within the RE actor, an efficient value for the vectorization degree is SRE. However, as discussed in Section 5.2, the value of SRE is determined dynamically. Thus, in our implementation, the vectorization degree of the sorting kernel K is adapted at run-time. After computation of the number of transitions Ntrans on the GPU, the value of Ntrans is communicated to the CPU, and then used by the CPU to configure the vectorization degree of K before executing the kernel. The performance benefit here of dynamically optimizing the vectorization degree significantly overshadows the overhead of communicating the Ntrans value from the GPU to the CPU.
The prefix sum operation in the TRT actor and the reduction operations in the RRE, LFT, and TSD actors also represent performance bottlenecks. For these actors, we optimize the prefix sum and reduction implementations in a number of ways. First, we perform interleaved addressing so that active kernels have consecutive indices (IDs). We also implement sequential addressing for memory read and write operations in the GPU to avoid shared memory bank conflicts. Furthermore, we apply loop unrolling (e.g., see [20–22]) for further performance improvement.