A Real-Time Semiautonomous Audio Panning System for Music Mixing
© E. Perez Gonzalez and J. D. Reiss. 2010
Received: 26 January 2010
Accepted: 23 April 2010
Published: 26 May 2010
A real-time semiautonomous stereo panning system for music mixing has been implemented. The system uses spectral decomposition, constraint rules, and cross-adaptive algorithms to perform real-time placement of sources in a stereo mix. A subjective evaluation test was devised to evaluate its quality against human panning. It was shown that the automatic panning technique performed better than a nonexpert and showed no significant statistical difference to the performance of a professional mixing engineer.
Stereo panning aims to transform a set of monaural signals into a two-channel signal in a pseudostereo field . Many methods and panning ratios have been proposed, the most common one being the sine-cosine panning law [2, 3]. In stereo panning the ratio at which its power is spread between the left and the right channels determines the position of the source. Over the years the use of panning on music sources has evolved and some common practices can now be identified.
Now that recallable digital systems have become common, it is possible to develop intelligent expert systems capable of aiding the work of the sound engineer. An expert panning system should be capable of creating a satisfactory stereo mix of multichannel audio by using blind signal analysis, without relying on knowledge of original source locations or other visual or contextual aids.
This paper extends the work first presented in . It presents an expert system capable of blindly characterizing multitrack inputs and semiautonomously panning sources with panning results comparable to a human mixing engineer. This was achieved by taking into account technical constraints and common practices for panning, while minimizing human input. Two different approaches are described and subjective evaluation demonstrates that the semi-autonomous panner has equivalent performance to that of a professional mixing engineer.
2. Panning Constraints and Common Practices
When the human expert begins to mix, he or she tends to do it from a monaural, all centered position, and gradually moves the pan pots . During this process, all audio signals are running through the mixer at all times. In other words, source placement is performed in realtime based on accumulated knowledge of the sound sources and the resultant mix, and there is no interruption to the signal path during the panning process.
Panning is not the result of individual channel decisions; it is the result of an interaction between channels. The audio engineer takes into account the content of all channels, and the interaction between them, in order to devise the correct panning position of every individual channel .
The sound engineer attempts to maintain balance across the stereo field [7, 8]. This helps maintain the overall energy of the mix evenly split over the stereo speakers and maximizes the dynamic use of the stereo channels.
In order to minimize spectral masking, channels with similar spectral content are placed apart from each other [6, 9]. This results in a mix where individual sources can be clearly distinguished, and this also helps when the listener uses the movement of his or her head to interpret spatial cues.
Hard panning is uncommon . It has been established that panning a ratio of 8 to 12 dBs is more than enough to achieve a full left or full right image . For this reason, the width of the panning positions is restricted.
Low-frequency content should not be panned. There are two main reasons for doing this. First, it ensures that the low-frequency content remains evenly distributed across speakers . This minimizes audible distortions that may occur in the high-power reproduction of low-frequencies. Second, the position of a low frequency source is often psychoacoustically imperceptible. In general, we cannot correctly localize frequencies lower than 200 Hz . It is thought that this is due to the fact that the use of InterTime Difference as a perceptual clue for localization of low frequency sources is highly dependent on room acoustics and loudspeaker placement, and Inter-Level Differences are not a useful perceptual cue at low frequencies since the head only provides significant attenuation of high-frequency sources .
High-priority sources tend to be kept towards the centre, while lower priority sources are more likely to be panned [7, 8]. For instance, the vocalist in a modern pop or rock group (often the lead performer) would often not be panned. This relates to the idea of matching a physical stage setup to the relative positions of the sources.
3.1. Cross-Adaptive Implementation
3.2. Adaptive Gating
Because noise on an input microphone channel may trigger undesired readings, the input signals are gated. The threshold of the gate is determined in an adaptive manner. By noise we refer not only to random ambient noise but also to interference due to nearby sources, such as the sound from adjacent instruments that are not meant to be input to a given channel.
Adaptive gating is used to ensure that features are extracted from a channel only when the intended signal is present and significantly stronger than the noise sources. The gating method is based on a method implemented in [16, 17] by Dugan. A reference microphone may be placed outside of the usable source microphone area to capture a signal representative of the undesired ambient and interference noise. The reference microphone signal is used to derive an adaptive threshold by opening the gate only if the input signal magnitude is grater than the reference microphone magnitude signal. Therefore the input signal is only passed to the side processing chain when its level exceeds that of the reference microphone signal.
3.3. Filter Bank Implementation
The implementation uses a filter bank to perform spectral decomposition of each individual channel. The filter bank does not affect the audio path since it is only used in the analysis section of the algorithm. It was chosen as opposed to other methods of classifying the dominant frequency or frequency range of a signal  because it does not require Fourier analysis, and hence is more amenable to a real-time implementation.
3.4. Determination of Dominant Frequency Range
Once the filter bank has been designed, the algorithm uses the band limited signal in each filter's output to obtain the absolute peak amplitude for each filter. The peak amplitude is measured within a 100 ms window. The algorithm uses the spectral output of each filter contained within the filter bank to calculate the peak amplitude of each band. By comparing these peak amplitudes, the filter with the highest peak is found. An accumulated score is maintained for the number of occurrences of the highest peak in each filter contained within the filter bank. This results in a classifier that determines the dominant filter band for an input channel from the highest accumulated score.
3.5. Cross-Adaptive Mapping
Now that each input channel has been analyzed and associated with a filter, it remains to define a mapping which results in the panning position of each output channel. The rules which drive this cross-adaptive mapping are as follows.
The first rule implements the constraint that low-frequency sources should not be panned. Thus, all sources with accumulated energy contained in a filter with a high cutoff frequency below 200 Hz are not panned and remain centered at all times.
The second rule decides the available panning step of each source. This is a positioning rule which uses equidistant spacing of all sources with the same dominant frequency range. Initially, all sources are placed in the centre. The available panning steps are calculated for every different accumulated filter, , where ranges from 1 to , based on the number of sources, whose dominant frequency range resides in that filter. Due to this channel dependency the algorithm will update itself every time a new input is detected in a new channel, or if the spectral content of an input channel suffers from a drastic change over time.
For filters which have not reached maximum accumulation there is no need to calculate the panning step, which makes the algorithm less computationally expensive. If only one repetition exists for a given filter ( ) the system places the input at the center.
Using this equation, if is odd, the first source, , is positioned at the center. When is even, the first two sources, , are positioned either side of the center. In both cases, the next two sources is positioned either side of the previous sources and so on such that sources are positioned further away from the centre as increases. The extreme panning positions are 0 and 1. However, as mentioned, hard panning is generally not preferred. So our current implementation provides a panning width control, , used to narrow or widen the panning range. The panning with has a valid range from 0 to 0.5 where 0 equates to the wide panning possible and 0.5 equates to no panning. In our current implementation, it defaults to . The value is subtracted for all panning positions bigger than 0.5 and added to all panning positions smaller than 0.5. In order to avoid sources originally panned left to cross to the right or sources originally panned right to cross to the left, the panning width algorithm ensures that sources in such cases default to the centre position.
3.7. The Panning Processing
An interpolation algorithm has been coded into the panner to avoid rapid changes of signal level. The interpolator has a 22 ms fade-in and fade-out, which ensures a smooth natural transition when the panning control step is changed.
4.1. Test Design
In order to evaluate the performance of the semiautonomous panner algorithm against human performance, a double blind test was designed. Both of autopanning algorithms were tested, the bandpass filter classifier known as algorithm type A, and the low-pass classifier known as algorithm type B. Algorithms were randomly tested in a double blind fashion.
The control group consisted of three professional human experts and one nonexpert, who had never panned music before. The test material consisted of 12 multitrack songs of different styles of music. Stereo sources were used in the form of two separate monotracks. Where acoustic drums were used, they would be recorded with multiple microphones and then premixed down into a stereo mix. Humans and algorithms used the same stereo drum and keyboard tracks as separate left and right monofiles. All 12 songs were panned by the expert human mixers and by the nonexpert human mixer. They were asked to pan the song while listening for the first time. They had the length of the song to determine their definitive panning positions. The same songs were passed through algorithms A and B only once for the entire length of the song. Although the goal was to give the human and machine mixers as close to the same information as possible, human mixers had the advantage of knowing which type of instrument it was. Therefore, they assigned priority according to this prior known knowledge. For this reason a similar priority scheme was chosen to compensate for this. Both A and B algorithms used the same priority schema. Mixes used during the test contain music freely available under creative commons copyright can be located in .
The tested population consisted of 20 professional sound mixing engineers, with an average experience of 6-year work in the professional audio market. The tests were performed over headphones, and both the human mixers and the test subjects used exactly the same headphones. The test lasted an average time of 82 minutes.
Double blind A/B testing was used with all possible permutations of algorithm A, algorithm B, expert and amateur panning. Each tested user answered a total of 32 questions, two of which were control questions, in order to test the subject's ability to identify stereo panning. The first control question consisted of asking the test subjects to rate their preference between a stereo and a monaural signal. During the initial briefing it was stressed to the test subject that stereo is not necessarily better than monaural audio. The second control question compared two stereo signals that had been panned in exactly the same manner.
4.2. Result Analysis
All resulting permutations were classified into the following categories: monaural versus stereo, same stereo versus same stereo file, method A versus method B, method A versus nonexpert mix, method B versus nonexpert mix, method A versus expert mix, and method B versus expert mix.
Results obtained on the question "How different is panning A compared to B?" were used to weight the results obtained for the second question "Which file, A or B, has better panning quality?". This is in order to have a form of neglecting incoherent answers such as "I find no difference between files A or B but I find the quality of B to be better compared to A".
Double blind panning quality evaluation table.
Number of Comparisons
Stereo versus Mono
Stereo versus Stereo
Identify them to be the same
Human Expert versus Method A
Method A versus Non-expert
No significant difference between algorithms
Method B versus Non-expert
Method B versus Expert
No significant difference between algorithms
Method A versus Method B
The remaining tests compared the two panning techniques against each other and against expert and nonexpert mixes. The tested audio engineers preferred the expert mixes to panning method A, but this result could only be given with 80% confidence. On average, non-expert mixes also were preferred to panning method A, but this result could not be considered significant, even with 80% confidence.
In contrast, panning method B was preferred over non-expert mixes with over 90% confidence. With at least 95% confidence, we can also state that method B was preferred over method A. Yet when method B is compared against expert mixes, there is no significant difference.
The preference for panning method B implies that low-pass spectral decomposition is preferred over band-pass spectral decomposition as a means of signal classification for the purpose of semi-autonomous panning. Furthermore, the lack of any statistical difference between panning method B and expert mixe, (in contrast to the significant preference for method B over non-expert mixes, and for expert mixes over method A) leads us to conclude that the semi-autonomous panning method B performs roughly equivalently to an expert mixing engineer.
5. Conclusions and Future Work
In terms of generating blind stereo panning up-mixes with minimum human interactions, we can conclude that it is possible to generate intelligent expert systems capable of performing better than a non-expert human while having no statistical difference when compared to a human expert. According to the subjective evaluation, low-pass filter-bank accumulative spectral decomposition features seem to perform significantly better than band-pass decompositions.
More sophisticated forms of performing source priority identification in an unaided manner need to be investigated. To further automate the panning technique, instrument identification and other feature extraction techniques could be employed to identify those channels with high priority. Better methods of eliminating microphone cross-talk noise need to be researched. Furthermore, in live sound situations, the sound engineer would have visual cues to aid in panning. For instance, the relative positions of the instruments on stage are often used to map sound sources in the stereo field. Video analysis techniques could be used to incorporate this into the panning constraints. Future subjective tests should include visual cues and be performed in real sound reinforcement conditions.
Finally, the work presented herein was restricted to stereo panning. In a two-or three-dimensional sound field, there are more degrees of freedom, but rules still apply. For instance, low-frequency sources are often placed towards the ground, while high-frequency sources are often placed above, corresponding to the fact that high frequency sources emitted near or below the ground would be heavily attenuated . It is the intent of the authors to extend this work to automatic placement of sound sources in a multispeaker, spatial audio environment.
- Gerzon MA: Signal processing for simulating realistic stereo images. Proceedings of the 93rd Convention Audio Engineering Society, October 1992, San Francisco, Calif, USAGoogle Scholar
- Anderson JL: Classic stereo imaging transforms—a review. submited to Computer Music Journal, http://www.dxarts.washington.edu/courses/567/08WIN/JL_Anderson_Stereo.pdf submited to Computer Music Journal,
- Griesinger D: Stereo and surround panning in practice. Proceedings of the 112th Audio Engineering Society Convention, May 2002, Munich, GermanyGoogle Scholar
- Perez_Gonzalez E, Reiss J: Automatic mixing: live downmixing stereo panner. Proceedings of the 7th International Conference on Digital Audio Effects (DAFx '07), 2007, Bordeaux, France 63-68.Google Scholar
- Self D, et al.: Recording consoles. In Audio Engineering: Know It All. Volume 1. 1st edition. Edited by: Self D. Newnes/Elsevier, Oxford, UK; 2009:761-807.Google Scholar
- Neiman R: Panning for gold: tutorials. Electronic Musician Magazine 2002, http://emusician.com/tutorials/emusic_panning_gold/Google Scholar
- Izhaki R: Mixing domains and objectives. In Mixing Audio: Concepts, Practices and Tools. 1st edition. Focal Press/Elsevier, Burlington, Vt, USA; 2007:58-71.Google Scholar
- Izhaki R: Panning. In Mixing Audio: Concepts, Practices and Tools. 1st edition. Focal Press/Elsevier, Burlington, Vt, USA; 2007:184-203.Google Scholar
- Bartlett B, Bartlett J: Recorder-mixers and mixing consoles. In Practical Recording Techniques. 3rd edition. Focal Press/Elsevier, Oxford, UK; 2009:259-275.Google Scholar
- Owsinski B: Element two: panorama—placing the sound in the soundfield. In The Mixing Engineer's Handbook. 2nd edition. Mix Books, Vallejo, Calif, USA; 2006:20-24.Google Scholar
- Rumsey F, McCormick T: Mixers. In Sound and Recording: An Introduction. 1st edition. Focal Press /Elsevier, Oxford, UK; 2006:96-153.Google Scholar
- White P: The creative process: pan position. In The Sound on Sound Book of Desktop Digital Sound. 1st edition. MPG Books, UK; 2000:169-170.Google Scholar
- Benjamin E: An experimental verification of localization in two-channel stereo. Proceedings of the 121st Convention Audio Engineering Society, 2006, San Fransisco, Calif, USAGoogle Scholar
- Beament J: The direction-finding system. In How we Hear Music: The relationship Between Music and the Hearing Mechanism. The Boydel Press, Suffolk, UK; 2001:127-130.Google Scholar
- Verfaille V, Zölzer U, Arfib D: Adaptive Digital Audio Effects (A-DAFx): a new class of sound transformations. IEEE Transactions on Audio, Speech and Language Processing 2006, 14(5):1817-1831.View ArticleGoogle Scholar
- Dugan DW: Automatic microphone mixing. Journal of the Audio Engineering Society 1975, 23(6):442-449.Google Scholar
- Dugan DW: Application of automatic mixing techniques to audio consoles. Proceedings of the 87th Convention of the Audio Engineering Society, October 1989, New York, NY, USA 18-21.Google Scholar
- Sethares WA, Milne AJ, Tiedje S, Prechtl A, Plamondon J: Spectral tools for dynamic tonality and audio morphing. Computer Music Journal 2009, 33(2):71-84. 10.1162/comj.2009.33.2.71View ArticleGoogle Scholar
- Perez_Gonzalez E, Reiss J: Automatic mixing tools for audio and music production. 2010, http://www.elec.qmul.ac.uk/digitalmusic/automaticmixing/Google Scholar
- Gibson D, Peterson G: The Art of Mixing: A Visual Guide to Recording, Engineering and Production. 1st edition. Mix Books /ArtistPro Press, USA; 1997.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.