Research


We have
developed a system where the user can teach the machine an
audio concept (such as a "boomy" sound) that she/he has in
mind in order to build a simple controller that can
manipulate sound in terms of that audio concept (for
example, make the sound "more boomy"), bypassing the
bottleneck of complex interfaces and individual
differences in descriptive terms.
For this
study, we focused on improving a reverberation tool.
First, we developed our own reverberator using digital
filters, mapping the parameters of the filters to measures
of the reverberation effect, so that the reverberator can
be controlled through common descriptors such as "reverberation time" or "spectral centroid". In the
learning process, a sound is first modified by a series of
reverberation settings using the reverberator. The user
then listens and rates each modified sound as to how well
it fits the audio concept she/he has in mind. The ratings
are finally mapped onto the controls of the reverberator
and a simple controller is built where the user can manipulate
the degree of her/his audio concept on a sound. Several
experiments conducted on human subjects showed that the
system learns quickly (<3 min), predicts user
responses well (mean correlation of 0.75) and meets users'
expectations (average human rating of 7.4 out of 10).
A
previous work has been done based on an equalizer. A
similar system has also been studied with application to
images.
Future research includes the combination of the
equalization and reverberation tools, the use of new tools
such as compression, the development of plugins, and the
creation of synonym maps based on the commonalities
between different individual concept mappings.
[www] Andrew Todd Sabin, Zafar Rafii,
and Bryan Pardo. "Weighting-Function-Based Rapid Mapping
of Descriptors to Audio Processing Parameters," Journal of the Audio
Engineering Society, Volume 59, Issue 6,
pp. 419-430, June, 2011.
[pdf] Zafar Rafii and Bryan Pardo.
"Learning to control a Reverberator using Subjective
Perceptual Descriptors," 10th International Society for Music
Information Retrieval, Kobe, Japan,
October 26-30, 2009.
[pdf] Zafar Rafii and Bryan Pardo. "A
Digital Reverberator controlled through Measures of the
Reverberation," Northwestern University, EECS Department
Technical Report, NWU-EECS-09-08, 2009.
*This work was supported by
National Science Foundation grant number IIS-0757544.
The
Degenerate Unmixing Estimation Technique (DUET) is a Blind
Source Separation method which can separate an arbitrary
number of unknown sources using a single stereo mixture.
DUET builds a two-dimensional histogram from the amplitude
ratio and phase difference between channels, where each
peak indicates a source with peak location corresponding
to the mixing parameters associated with that source.
Provided that the time-frequency bins of the sources do
not overlap too much - an assumption generally validated
by speech mixtures - DUET identifies the peaks and
partitions the time-frequency representation of the
mixture by assigning each bin to the source with the
closest mixing parameters. However when time-frequency
bins of the sources overlap too much, as often seen in
music mixtures when using the Short-Time Fourier
Transform, peaks start to fuse in the 2d histogram, so
that DUET cannot perform separation effectively.
We
propose to improve peak/source separation in DUET by
building the 2d histogram from an alternative
time-frequency representation based on the Constant Q
Transform (CQT). Unlike the Fourier Transform, the CQT has
a logarithmic frequency resolution, mirroring the human
auditory system and matching the geometrically spaced
frequencies of the Western music scale, therefore better
adapted to music mixtures. We also propose other
contributions to enhance DUET, including adaptive
boundaries for the 2d histogram to improve peak resolving
when sources are spatially too close, and Wiener filtering
to improve source reconstruction. Experiments on mixtures
of piano notes and harmonic sources showed that
peak/source separation is overall improved, especially at
low octaves (<200 Hz) and for small mixing angles
(<pi/6 rad). Experiments on mixtures of female and male
speech showed that the use of CQT gives equally good
results.
Unlike
the classic DUET based on the Fourier Transform, DUET
combined with the CQT can resolve adjacent pitches in low
octaves as well as in high octaves thanks to the log
frequency resolution of the CQT:
[mp3] Mixture of the 3 piano notes
A2, Bb2, & B2
[mp3] 1. Original
piano note A2
[mp3] 2. Original
piano note Bb2
[mp3] 3. Original
piano note B2
[mp3] 1. Estimated
piano note A2
[mp3] 2. Estimated
piano note Bb2
[mp3] 3. Estimated
piano note B2
DUET
combined with the CQT and adaptive boundaries helps to
improve separation when sources have low pitches (for
example here between the two cellos) and/or are spatially
too close to each other:
[mp3] Mixture of 4 harmonic sources
[mp3] 1. Original
cello 1
[mp3] 2. Original
cello 2
[mp3] 3. Original
flute
[mp3] 4. Original
strings
[mp3] 1. Estimated
cello 1
[mp3] 2. Estimated
cello 2
[mp3] 3. Estimated
flute
[mp3] 4. Estimated
strings
[pdf] Zafar Rafii and Bryan Pardo.
"Degenerate Unmixing Estimation Technique using the
Constant Q Transform," 36th
International Conference on Acoustics, Speech and Signal
Processing, Prague, Czech Republic,
May 22-27, 2011.
*This work was supported by
National Science Foundation grant numbers IIS-0757544 and IIS-0643752.
Repetition is a core principle in music. Typical musical pieces are
generally charaterized by an underlying repeating structure over which
varying elements are superimposed. This is especially true for popular
songs where a singer typically overlays varying vocals on a repeating
accompaniment. Based on this simple observation, we introduce the
REpeating Pattern Extraction Technique (REPET), a novel and simple
approach for separating the repeating musical "background" from the
non-repeating musical "foreground". The basic idea to identify the
repeating frames in the audio, compare them to a repeating model, and
extract the repeating patterns. This can be done in 3 stages: (1)
identify the period p of the underlying repeating structure in the
spectrogram V of a mixture x using the beat spectrum b, (2) build a
repeating segment model S from the segmented spectrogram V, (3) derive
a repeating spectrogram model W using V and S and build a
time-frequency mask M to extract the repeating patterns. The result is
a simple but effective music/voice separation system. Note that a
binary mask can further be derived from the soft mask by defining some
threshold. In that case, the separation performance would increase, but at the price of introducing artifacts in the estimates.
Unlike other separation approaches, REPET does not depend on particular features, does not rely on complex frameworks, and does not need prior training. Because it is only based on self-similarity, it has the advantage of being simple, fast, and blind, and therefore completely and easily automatable. Evaluation on a dataset of 1,000 song clips showed that this method can be successfully applied fofr music/voice separation, improving on the performance of the best automatic version of the a recent competitive music/voice separation system.
There are several directions in which we would like to take this promising work. First, we would like to extend REPET for the separation of full songs, by adapting the repeating model along time to handle possible variations within the repeating structure (e.g. a verse followed by the chorus). Then, since the repeating background can involve repetitions happening not necessarily at a fixed period rate, we would like to use a similarity matrix to improve the separation of REPET. Finally, since music usually involves periodically repeating patterns at different levels, we would like to extend REPET to extract multiple hierarchical repeating structures.
[mp3] Propellerheads - History
Repeating (excerpt)
[mp3]
Repeating structure ~ background
[mp3]
Non-repeating structure ~ vocals
[mp3] The Blues Brothers - Sweet Home Chicago (excerpt)
[mp3]
Repeating structure ~ background
[mp3]
Non-repeating structure ~ vocals
[mp3] Rebecca Black - Friday (excerpt) (because tomorrow is Saturday...)
[mp3]
Repeating structure ~ background
[mp3]
Non-repeating structure ~ vocals
[mp3] RJD2 - Ghostwriter (excerpt)
[mp3]
Repeating structure ~ background
[mp3]
Non-repeating structure ~ voice & trumpet
[pdf] Antoine Liutkus, Zafar Rafii, Roland Badeau, Bryan Pardo, and Gaël Richard. “Adaptive Filtering for Music/Voice Separation Exploiting the Repeating Musical Structure,” 37th International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, March 25-30, 2012.
[pdf] Zafar Rafii and Bryan Pardo. "A Simple Music/Voice Separation Method based on the Extraction of the Repeating Musical Structure," 36th International Conference on Acoustics, Speech and Signal Processing, Prague, Czech Republic, May 22-27, 2011.
[www] Zafar Rafii and Bryan Pardo. “Acoustic Separation System and Method,” Re: US Provisional Patent Application Serial No. 61/534,280.
*This work was funded by National Science Foundation grant number IIS-0643752.