Research Article A Real-Time Semiautonomous Audio Panning System for Music Mixing

A real-time semiautonomous stereo panning system for music mixing has been implemented. The system uses spectral decomposition, constraint rules, and cross-adaptive algorithms to perform real-time placement of sources in a stereo mix. A subjective evaluation test was devised to evaluate its quality against human panning. It was shown that the automatic panning technique performed better than a nonexpert and showed no significant statistical difference to the performance of a professional mixing engineer.


Introduction
Stereo panning aims to transform a set of monaural signals into a two-channel signal in a pseudostereo field [1]. Many methods and panning ratios have been proposed, the most common one being the sine-cosine panning law [2,3]. In stereo panning the ratio at which its power is spread between the left and the right channels determines the position of the source. Over the years the use of panning on music sources has evolved and some common practices can now be identified. Now that recallable digital systems have become common, it is possible to develop intelligent expert systems capable of aiding the work of the sound engineer. An expert panning system should be capable of creating a satisfactory stereo mix of multichannel audio by using blind signal analysis, without relying on knowledge of original source locations or other visual or contextual aids.
This paper extends the work first presented in [4]. It presents an expert system capable of blindly characterizing multitrack inputs and semiautonomously panning sources with panning results comparable to a human mixing engineer. This was achieved by taking into account technical constraints and common practices for panning, while minimizing human input. Two different approaches are described and subjective evaluation demonstrates that the semi-autonomous panner has equivalent performance to that of a professional mixing engineer.

Panning Constraints and Common Practices
In practice, the placement of sound sources is achieved using a combination of creative choices and technical constraints based on human perception of source localization. It is not the purpose of this paper to emulate the more artistic and subjective decisions in source placement. Rather, we seek to embed the common practices and technical constraints into an algorithm which automatically places sound sources. The idea behind developing an expert semi-autonomous panning machine is to use well-established common rules to devise the spatial positioning of a signal.
(1) When the human expert begins to mix, he or she tends to do it from a monaural, all centered position, and gradually moves the pan pots [5]. During this process, all audio signals are running through the mixer at all times. In other words, source placement is performed in realtime based on accumulated knowledge of the sound sources and the resultant 2 EURASIP Journal on Advances in Signal Processing mix, and there is no interruption to the signal path during the panning process. (2) Panning is not the result of individual channel decisions; it is the result of an interaction between channels. The audio engineer takes into account the content of all channels, and the interaction between them, in order to devise the correct panning position of every individual channel [6]. (3) The sound engineer attempts to maintain balance across the stereo field [7,8]. This helps maintain the overall energy of the mix evenly split over the stereo speakers and maximizes the dynamic use of the stereo channels. (4) In order to minimize spectral masking, channels with similar spectral content are placed apart from each other [6,9]. This results in a mix where individual sources can be clearly distinguished, and this also helps when the listener uses the movement of his or her head to interpret spatial cues. (5) Hard panning is uncommon [10]. It has been established that panning a ratio of 8 to 12 dBs is more than enough to achieve a full left or full right image [11]. For this reason, the width of the panning positions is restricted. (6) Low-frequency content should not be panned. There are two main reasons for doing this. First, it ensures that the low-frequency content remains evenly distributed across speakers [12]. This minimizes audible distortions that may occur in the high-power reproduction of low-frequencies. Second, the position of a low frequency source is often psychoacoustically imperceptible. In general, we cannot correctly localize frequencies lower than 200 Hz [13]. It is thought that this is due to the fact that the use of InterTime Difference as a perceptual clue for localization of low frequency sources is highly dependent on room acoustics and loudspeaker placement, and Inter-Level Differences are not a useful perceptual cue at low frequencies since the head only provides significant attenuation of high-frequency sources [14]. (7) High-priority sources tend to be kept towards the centre, while lower priority sources are more likely to be panned [7,8]. For instance, the vocalist in a modern pop or rock group (often the lead performer) would often not be panned. This relates to the idea of matching a physical stage setup to the relative positions of the sources.

Implementation
3.1. Cross-Adaptive Implementation. The automatic panner is implemented as a cross-adaptive effect, where the output of each channel is determined from analysis of all input channels [15]. For applications that require a realtime signal processing, the signal analysis and feature extraction has been implemented using side chain processing, as depicted in Figure 1. The audio signal flow remains real-time while Feature extraction Accumulation · · · · · · · · · · · · Cross-adaptive feature processing Side chain processing Signal processing Digital audio processing unit Digital audio processing unit Digital audio processing unit the required analysis of the input signals is performed in separate instances. The signal analysis involves accumulating a weighted time average of extracted features. Accumulation allows us to quickly converge on an appropriate panning position in the case of a stationary signal or smoothly adjust the panning position as necessary in the case of changing signals. Once the feature extraction within the analysis side chain is completed, then the features from each channel are analyzed in order to determine new panning positions for each channel. Control signals are sent to the signal processing side in order to trigger the desired panning commands.

Adaptive Gating.
Because noise on an input microphone channel may trigger undesired readings, the input signals are gated. The threshold of the gate is determined in an adaptive manner. By noise we refer not only to random ambient noise but also to interference due to nearby sources, such as the sound from adjacent instruments that are not meant to be input to a given channel.
Adaptive gating is used to ensure that features are extracted from a channel only when the intended signal is present and significantly stronger than the noise sources. The gating method is based on a method implemented in [16,17] by Dugan. A reference microphone may be placed outside of the usable source microphone area to capture a signal representative of the undesired ambient and interference noise. The reference microphone signal is used to derive an adaptive threshold by opening the gate only if the input signal magnitude is grater than the reference microphone magnitude signal. Therefore the input signal is only passed to the side processing chain when its level exceeds that of the reference microphone signal.

Filter Bank Implementation.
The implementation uses a filter bank to perform spectral decomposition of each individual channel. The filter bank does not affect the audio path since it is only used in the analysis section of the algorithm. It was chosen as opposed to other methods of classifying the dominant frequency or frequency range of a signal [18] because it does not require Fourier analysis, and hence is more amenable to a real-time implementation.
For the purpose of finding the optimal spectral decomposition for performing automatic panning, two different eight-band filter banks were designed and tested. The first consisted of a quasiflat frequency response bandpass filter bank, which for the purposes of this paper we will call filter bank type A in Figure 2, and the second contained a lowpass filter decomposition filter bank, which we will call filter bank type B in Figure 3. In order to provide an adaptive frequency resolution for each filter bank, the total number of filters, K, is equal to the number of input channels that are meant to be panned. The individual gains of each filter were optimized to achieve a quasiflat frequency response.

Determination of Dominant Frequency Range.
Once the filter bank has been designed, the algorithm uses the band limited signal in each filter's output to obtain the absolute peak amplitude for each filter. The peak amplitude is measured within a 100 ms window. The algorithm uses the spectral output of each filter contained within the filter bank to calculate the peak amplitude of each band. By comparing these peak amplitudes, the filter with the highest peak is found. An accumulated score is maintained for the number of occurrences of the highest peak in each filter contained within the filter bank. This results in a classifier that determines the dominant filter band for an input channel from the highest accumulated score. The block diagram of the filter bank analysis algorithm is provided in Figure 4. It should be noted that this approach uses digital logic operations of comparison and addition only, which makes it highly attractive for an efficient digital implementation.

Cross-Adaptive
Mapping. Now that each input channel has been analyzed and associated with a filter, it remains to define a mapping which results in the panning position of each output channel. The rules which drive this crossadaptive mapping are as follows.
The first rule implements the constraint that lowfrequency sources should not be panned. Thus, all sources with accumulated energy contained in a filter with a high cutoff frequency below 200 Hz are not panned and remain centered at all times.
The second rule decides the available panning step of each source. This is a positioning rule which uses equidistant spacing of all sources with the same dominant frequency 4 EURASIP Journal on Advances in Signal Processing The same for all K filters  range. Initially, all sources are placed in the centre. The available panning steps are calculated for every different accumulated filter, k, where k ranges from 1 to K, based on the number of sources, N k , whose dominant frequency range resides in that filter. Due to this channel dependency the algorithm will update itself every time a new input is detected in a new channel, or if the spectral content of an input channel suffers from a drastic change over time.
For filters which have not reached maximum accumulation there is no need to calculate the panning step, which makes the algorithm less computationally expensive. If only one repetition exists for a given kth filter (N k = 1) the system places the input at the center.
The following equation gives the panning space location of the ith source residing in the kth filter: where P(i, k) is the ith available panning step in the kth filter, i ranges from 1 to N k and P(i, k) = 0.5 corresponds to a center panning position.
Using this equation, if N k is odd, the first source, i = 1, is positioned at the center. When N k is even, the first two sources, i = 2, 3, are positioned either side of the center. In both cases, the next two sources is positioned either side of the previous sources and so on such that sources are positioned further away from the centre as i increases. The extreme panning positions are 0 and 1. However, as mentioned, hard panning is generally not preferred. So our current implementation provides a panning width control, P W , used to narrow or widen the panning range. The panning with has a valid range from 0 to 0.5 where 0 equates to the wide panning possible and 0.5 equates to no panning. In our current implementation, it defaults to P W = 0.059. The P W value is subtracted for all panning positions bigger than 0.5 and added to all panning positions smaller than 0.5. In order to avoid sources originally panned left to cross to the right or sources originally panned right to cross to the left, the panning width algorithm ensures that sources in such cases default to the centre position. (1) provides the panning position for each of the sources with dominant spectral content residing in the kth filter, but it does not say how each of those sources are ordered from source i = 1 to i = N k . The common practices mentioned earlier would suggest that certain sources, such as lead vocals, would be less likely to be panned to extremes than others, such as incidental percussions. However, the current implementation of our automatic panner does not have access to such information. Thus, the authors have proposed to use a priority driven system in which the user can label the channels according to importance. In this sense, it is a semiblind automatic system. Thus, all sources are ordered from highest to the lowest priority. For the N k sources residing in the kth filter, the first panning step is taken by the highest priority source, the second panning step by the next highest priority source, and so on. The block diagram containing the constrained decision control rule stage of the algorithm is presented in Figure 5.

The Panning Processing.
Once the appropriate panning position was determined, a sine-cosine panning law [2] was used to place sources in the final sound mix: An interpolation algorithm has been coded into the panner to avoid rapid changes of signal level. The interpolator has a 22 ms fade-in and fade-out, which ensures a smooth natural transition when the panning control step is changed.
In Figure 6, the result of down-mixing 12 sinusoidal test signals through the automatic panner is shown. It can be seen that both f 1 and f 12 are kept centered and added together because their spectral content is below 200 Hz. The three sinusoids with a frequency of 5 kHz have been evenly spread.  f 2 has been allocated to the center due to priority while f 4 has been send to the left and f 6 has been send to the right, in accordance with (1). Because there is no other signal with the same spectral content than f 11 , it has been assigned to the center. The four sinusoids with a spectral content of 15 kHz have been evenly spread. Because of priority, f 3 has been assigned a value of 0.33, f 7 has been assigned a value of 0.66, f 9 has been assigned all the way to the left, and f 10 has been assigned all the way to the right, in accordance with (1). Finally, the two sinusoids with a spectral content of 20 KHz have been panned to opposite sides. All results prove to be in accordance with the constraint rules proposed for crossadaptive mapping.

Test Design.
In order to evaluate the performance of the semiautonomous panner algorithm against human performance, a double blind test was designed. Both of autopanning algorithms were tested, the bandpass filter classifier known as algorithm type A, and the low-pass classifier known as algorithm type B. Algorithms were randomly tested in a double blind fashion. The control group consisted of three professional human experts and one nonexpert, who had never panned music before. The test material consisted of 12 multitrack songs of different styles of music. Stereo sources were used in the form of two separate monotracks. Where acoustic drums were used, they would be recorded with multiple microphones and then premixed down into a stereo mix. Humans and algorithms used the same stereo drum and keyboard tracks as separate left and right monofiles. All 12 songs were panned by the expert human mixers and by the nonexpert human mixer. They were asked to pan the song while listening for the first time. They had the length of the song to determine their definitive panning positions. The same songs were passed through algorithms A and B only once for the entire length of the song. Although the goal was to give the human and machine mixers as close to the same information as possible, human mixers had the advantage of knowing which type of instrument it was. Therefore, they assigned priority according to this prior known knowledge. For this reason a similar priority scheme was chosen to compensate for this. Both A and B algorithms used the same priority schema. Mixes used during the test contain music freely available under creative commons copyright can be located in [19].
As shown in Figure 7, the test used two questions to measure the perceived overall quality of the panning for each audio comparison. For the first question, "how different is the panning of A compared to B?", a continuous slider with extremities marked "exactly the same" and "completely different" was used. The answer obtained in this question was used as a weighting factor in order to decide the validity of the next question. The second question, "which file, A or B, has better panning?", used a continuous slider with extremes marked "A quality is ideal" and "B quality is ideal". For both of these questions, no visible scale was added in order not to influence their decision. The test subjects were also provided with a comment box that was used for them to justify their answers to the previous two questions. During the test it was observed that expert subjects tend to use the name of the instrument to influence their panning decisions. In other words they would look for the "bass" label to make sure that they kept it center. This was an encouraging sign that panning amongst professionals follows constraint rules similar to those that were implemented in the algorithms.
The tested population consisted of 20 professional sound mixing engineers, with an average experience of 6-year work in the professional audio market. The tests were performed over headphones, and both the human mixers and the test subjects used exactly the same headphones. The test lasted an average time of 82 minutes.
Double blind A/B testing was used with all possible permutations of algorithm A, algorithm B, expert and amateur panning. Each tested user answered a total of 32 questions, two of which were control questions, in order to test the subject's ability to identify stereo panning. The first control question consisted of asking the test subjects to rate their preference between a stereo and a monaural signal. During the initial briefing it was stressed to the test subject that stereo is not necessarily better than monaural audio. The second control question compared two stereo signals that had been panned in exactly the same manner.

Result Analysis.
All resulting permutations were classified into the following categories: monaural versus stereo, same stereo versus same stereo file, method A versus method B, method A versus nonexpert mix, method B versus nonexpert mix, method A versus expert mix, and method B versus expert mix.
Results obtained on the question "How different is panning A compared to B?" were used to weight the results obtained for the second question "Which file, A or B, has better panning quality?". This is in order to have a form of neglecting incoherent answers such as "I find no difference between files A or B but I find the quality of B to be better compared to A".
Answers to the first question showed that, with at least 95% confidence, the test subjects strongly preferred stereo to monaural mixes. The second question also confirmed with at least 95% confidence that professional audio engineers find no significant difference when asked to compare two identical stereo tracks. The results are summarized in Table 1, and the evaluation results with 95% confidence intervals are depicted in Figure 8.
The remaining tests compared the two panning techniques against each other and against expert and nonexpert mixes. The tested audio engineers preferred the expert mixes to panning method A, but this result could only be given with 80% confidence. On average, non-expert mixes also were preferred to panning method A, but this result could not be considered significant, even with 80% confidence.
In contrast, panning method B was preferred over nonexpert mixes with over 90% confidence. With at least 95% confidence, we can also state that method B was preferred over method A. Yet when method B is compared against expert mixes, there is no significant difference.
The preference for panning method B implies that lowpass spectral decomposition is preferred over band-pass spectral decomposition as a means of signal classification for the purpose of semi-autonomous panning. Furthermore, the lack of any statistical difference between panning method B and expert mixe, (in contrast to the significant preference for method B over non-expert mixes, and for expert mixes over method A) leads us to conclude that the semi-autonomous panning method B performs roughly equivalently to an expert mixing engineer.
It was found that the band-pass filter bank, method A, tended to assign input channels to less filters than the lowpass filter bank, method B. The distribution of input tracks among filters for an 8-channel song for both methods is depicted in Figure 9. In effect, panning method B is more discriminating as to whether two inputs have overlapping spectral content and hence is less likely to unnecessarily place sources far from each other. This may account for the preference of panning method B over panning method A.
The subjects justified their answers in accordance with the common practices mentioned previously. They relied heavily on manual instrument recognition to determine the appropriate position of each channel. It was also found that any violation of common practice, such as panning the lead vocals, would result in a significantly low measure of panning quality. One of the most interesting findings was that spatial balance seemed to be not only a significant cue used to determine panning quality but was also a distinguishing factor between expert and non-expert mixes. Non-expert mixes were often weighted to one side, whereas almost universally, expert mixes had the average source position in the centre. Both panning methods A and B were devised to  Figure 8: Summarized results for the subjective evaluation. The first two tests were references (comparing stereo against monaural, and comparing identical files), and the remaining questions compared the two proposed auto-panning methods against each other and against expert and non-expert mixes. 95% confidence intervals are provided.
perform optimal left to right balancing. Histograms of source positions which demonstrate these behaviors are depicted in Figure 10.

Conclusions and Future Work
In terms of generating blind stereo panning up-mixes with minimum human interactions, we can conclude that it is possible to generate intelligent expert systems capable of performing better than a non-expert human while having no statistical difference when compared to a human expert. According to the subjective evaluation, low-pass filterbank accumulative spectral decomposition features seem to perform significantly better than band-pass decompositions. More sophisticated forms of performing source priority identification in an unaided manner need to be investigated. To further automate the panning technique, instrument identification and other feature extraction techniques could be employed to identify those channels with high priority. Better methods of eliminating microphone cross-talk noise need to be researched. Furthermore, in live sound situations, the sound engineer would have visual cues to aid in panning. For instance, the relative positions of the instruments on stage are often used to map sound sources in the stereo field. Video analysis techniques could be used to incorporate this into the panning constraints. Future subjective tests should include visual cues and be performed in real sound reinforcement conditions. Finally, the work presented herein was restricted to stereo panning. In a two-or three-dimensional sound field, there are more degrees of freedom, but rules still apply. For instance, low-frequency sources are often placed towards the ground, while high-frequency sources are often placed above, corresponding to the fact that high frequency sources emitted near or below the ground would be heavily attenuated [20]. It is the intent of the authors to extend this work to automatic placement of sound sources in a multispeaker, spatial audio environment.