Research

     

Mapping Audio Concepts to Audio Tools

    We have developed a system where the user can teach the machine an audio concept (such as a "boomy" sound) that she/he has in mind in order to build a simple controller that can manipulate sound in terms of that audio concept (for example, make the sound "more boomy"), bypassing the bottleneck of complex interfaces and individual differences in descriptive terms.

    For this study, we focused on improving a reverberation tool. First, we developed our own reverberator using digital filters, mapping the parameters of the filters to measures of the reverberation effect, so that the reverberator can be controlled through common descriptors such as "reverberation time" or "spectral centroid". In the learning process, a sound is first modified by a series of reverberation settings using the reverberator. The user then listens and rates each modified sound as to how well it fits the audio concept she/he has in mind. The ratings are finally mapped onto the controls of the reverberator and a simple controller is built where the user can manipulate the degree of her/his audio concept on a sound. Several experiments conducted on human subjects showed that the system learns quickly (<3 min), predicts user responses well (mean correlation of 0.75) and meets users' expectations (average human rating of 7.4 out of 10).

    A previous work has been done based on an equalizer. A similar system has also been studied with application to images. Future research includes the combination of the equalization and reverberation tools, the use of new tools such as compression, the development of plugins, and the creation of synonym maps based on the commonalities between different individual concept mappings.

[www] Andrew Todd Sabin, Zafar Rafii, and Bryan Pardo. "Weighting-Function-Based Rapid Mapping of Descriptors to Audio Processing Parameters," Journal of the Audio Engineering Society, Volume 59, Issue 6, pp. 419-430, June, 2011.

[pdf] Zafar Rafii and Bryan Pardo. "Learning to control a Reverberator using Subjective Perceptual Descriptors," 10th International Society for Music Information Retrieval, Kobe, Japan, October 26-30, 2009.

[pdf] Zafar Rafii and Bryan Pardo. "A Digital Reverberator controlled through Measures of the Reverberation," Northwestern University, EECS Department Technical Report, NWU-EECS-09-08, 2009.

*This work was supported by National Science Foundation grant number IIS-0757544.

DUET using the Constant Q Transform

    The Degenerate Unmixing Estimation Technique (DUET) is a Blind Source Separation method which can separate an arbitrary number of unknown sources using a single stereo mixture. DUET builds a two-dimensional histogram from the amplitude ratio and phase difference between channels, where each peak indicates a source with peak location corresponding to the mixing parameters associated with that source. Provided that the time-frequency bins of the sources do not overlap too much - an assumption generally validated by speech mixtures - DUET identifies the peaks and partitions the time-frequency representation of the mixture by assigning each bin to the source with the closest mixing parameters. However when time-frequency bins of the sources overlap too much, as often seen in music mixtures when using the Short-Time Fourier Transform, peaks start to fuse in the 2d histogram, so that DUET cannot perform separation effectively.

    We propose to improve peak/source separation in DUET by building the 2d histogram from an alternative time-frequency representation based on the Constant Q Transform (CQT). Unlike the Fourier Transform, the CQT has a logarithmic frequency resolution, mirroring the human auditory system and matching the geometrically spaced frequencies of the Western music scale, therefore better adapted to music mixtures. We also propose other contributions to enhance DUET, including adaptive boundaries for the 2d histogram to improve peak resolving when sources are spatially too close, and Wiener filtering to improve source reconstruction. Experiments on mixtures of piano notes and harmonic sources showed that peak/source separation is overall improved, especially at low octaves (<200 Hz) and for small mixing angles (<pi/6 rad). Experiments on mixtures of female and male speech showed that the use of CQT gives equally good results.

    Unlike the classic DUET based on the Fourier Transform, DUET combined with the CQT can resolve adjacent pitches in low octaves as well as in high octaves thanks to the log frequency resolution of the CQT:
[mp3] Mixture of the 3 piano notes A2, Bb2, & B2
[mp3] 1. Original piano note A2
[mp3] 2. Original piano note Bb2
[mp3] 3. Original piano note B2
[mp3] 1. Estimated piano note A2
[mp3] 2. Estimated piano note Bb2
[mp3] 3. Estimated piano note B2

    DUET combined with the CQT and adaptive boundaries helps to improve separation when sources have low pitches (for example here between the two cellos) and/or are spatially too close to each other:
[mp3] Mixture of 4 harmonic sources
[mp3] 1. Original cello 1
[mp3] 2. Original cello 2
[mp3] 3. Original flute
[mp3] 4. Original strings
[mp3] 1. Estimated cello 1
[mp3] 2. Estimated cello 2
[mp3] 3. Estimated flute
[mp3] 4. Estimated strings

[pdf] Zafar Rafii and Bryan Pardo. "Degenerate Unmixing Estimation Technique using the Constant Q Transform," 36th International Conference on Acoustics, Speech and Signal Processing, Prague, Czech Republic, May 22-27, 2011.

*This work was supported by National Science Foundation grant numbers IIS-0757544 and IIS-0643752.

REpeating Pattern Extraction Technique (REPET)

    Repetition is a core principle in music. Typical musical pieces are generally charaterized by an underlying repeating structure over which varying elements are superimposed. This is especially true for popular songs where a singer typically overlays varying vocals on a repeating accompaniment. Based on this simple observation, we introduce the REpeating Pattern Extraction Technique (REPET), a novel and simple approach for separating the repeating musical "background" from the non-repeating musical "foreground". The basic idea to identify the repeating frames in the audio, compare them to a repeating model, and extract the repeating patterns. This can be done in 3 stages: (1) identify the period p of the underlying repeating structure in the spectrogram V of a mixture x using the beat spectrum b, (2) build a repeating segment model S from the segmented spectrogram V, (3) derive a repeating spectrogram model W using V and S and build a time-frequency mask M to extract the repeating patterns. The result is a simple but effective music/voice separation system. Note that a binary mask can further be derived from the soft mask by defining some threshold. In that case, the separation performance would increase, but at the price of introducing artifacts in the estimates.

    Unlike other separation approaches, REPET does not depend on particular features, does not rely on complex frameworks, and does not need prior training. Because it is only based on self-similarity, it has the advantage of being simple, fast, and blind, and therefore completely and easily automatable. Evaluation on a dataset of 1,000 song clips showed that this method can be successfully applied fofr music/voice separation, improving on the performance of the best automatic version of the a recent competitive music/voice separation system.

    There are several directions in which we would like to take this promising work. First, we would like to extend REPET for the separation of full songs, by adapting the repeating model along time to handle possible variations within the repeating structure (e.g. a verse followed by the chorus). Then, since the repeating background can involve repetitions happening not necessarily at a fixed period rate, we would like to use a similarity matrix to improve the separation of REPET. Finally, since music usually involves periodically repeating patterns at different levels, we would like to extend REPET to extract multiple hierarchical repeating structures.

[mp3] Propellerheads - History Repeating (excerpt)
[mp3] Repeating structure ~ background
[mp3] Non-repeating structure ~ vocals

[mp3] The Blues Brothers - Sweet Home Chicago (excerpt)
[mp3] Repeating structure ~ background
[mp3] Non-repeating structure ~ vocals

[mp3] Rebecca Black - Friday (excerpt) (because tomorrow is Saturday...)
[mp3] Repeating structure ~ background
[mp3] Non-repeating structure ~ vocals

[mp3] RJD2 - Ghostwriter (excerpt)
[mp3] Repeating structure ~ background
[mp3] Non-repeating structure ~ voice & trumpet

[pdf] Antoine Liutkus, Zafar Rafii, Roland Badeau, Bryan Pardo, and Gaël Richard. “Adaptive Filtering for Music/Voice Separation Exploiting the Repeating Musical Structure,” 37th International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, March 25-30, 2012.

[pdf] Zafar Rafii  and Bryan Pardo. "A Simple Music/Voice Separation Method based on the Extraction of the Repeating Musical Structure," 36th International Conference on Acoustics, Speech and Signal Processing, Prague, Czech Republic, May 22-27, 2011.

[www] Zafar Rafii and Bryan Pardo. “Acoustic Separation System and Method,” Re: US Provisional Patent Application Serial No. 61/534,280.

*This work was funded by National Science Foundation grant number IIS-0643752.