The objective of my research is to develop novel approaches to improve audio analysis, to let people access, manipulate, and enjoy audio more easily. Below is a list of some of the projects I have worked on over the past few years:
- Mapping audio concepts to audio tools: an adaptive reverberation tool
- DUET using CQT: stereo source separation adapted to music signals
- REpeating Pattern Extraction Technique (REPET): source separation by repetition
- Audio fingerprinting for cover identification: match a sample from a live performance to its studio version

Mapping Audio Concepts to Audio Tools

People often think about sound in terms of subjective audio concepts that do not necessarily have a known mapping onto the controls of existing audio tools. For example, a bass player may wish to use a reverberation tool to make a recording of her/his bass sound more "boomy"; unfortunately there is no "boomy" knob. We developed a system that can quickly learn an audio concept from a user (e.g., a "boomy" effect) and generate a simple audio controller than can manipulate sounds in terms of that audio concept (e.g., make a sound more "boomy"), bypassing the bottleneck of technical knowledge of complex interfaces and individual differences in subjective terms.

For this study, we focused on improving on a reverberation tool. To begin with, we developed a reverberator using digital filters, mapping the parameters of the digital filters to measures of the reverberation effect, so that the reverberator can be controlled through meaningful descriptors such as "reverberation time" or "spectral centroid." In the learning process, a given sound is first modified by a series of reverberation settings using the reverberator. The user then listens and rates each modified sound as to how well it fits the audio concept she/he has in mind. The ratings are finally mapped onto the controls of the reverberator and a simple controller is built with which the user will be able to manipulate the degree of her/his audio concept on a sound. Several experiments conducted on human subjects showed that the system learns quickly (under 3 minutes), predicts user responses well (mean correlation of 0.75), and meets users' expectations (average human rating of 7.4 out of 10).

A previous study was conducted based on an equalizer. A similar system has also been studied with application to images. Future research includes the combination of equalization and reverberation tools, the use of new tools such as compression, the development of plugins, and the creation of synonym maps based on the commonalities between different individual concept mappings. More information about this project can also be found on the website of the Interactive Audio Lab.

[pdf] Andrew Todd Sabin, Zafar Rafii, and Bryan Pardo. "Weighting-Function-Based Rapid Mapping of Descriptors to Audio Processing Parameters," Journal of the Audio Engineering Society, 59(6):419--430, June 2011.

[pdf] Zafar Rafii and Bryan Pardo. "Learning to control a Reverberator using Subjective Perceptual Descriptors," 10th International Society for Music Information Retrieval, Kobe, Japan, October 26-30 2009. (poster)

[pdf] Zafar Rafii and Bryan Pardo. "A Digital Reverberator controlled through Measures of the Reverberation," Northwestern University, EECS Department Technical Report, NWU-EECS-09-08, 2009.

*This work was supported by National Science Foundation grant number IIS-0757544.

DUET using CQT

The Degenerate Unmixing Estimation Technique (DUET) is a blind source separation method that can separate an arbitrary number of unknown sources using a single stereo mixture. DUET builds a two-dimensional histogram from the amplitude ratio and phase difference between channels, where each peak indicates a source, with peak location corresponding to the mixing parameters associated with that source. Provided that the time-frequency bins of the sources do not overlap too much - an assumption generally validated by speech mixtures, DUET partitions the time-frequency representation of the mixture by assigning each bin to the source with the closest mixing parameters. However, when time-frequency bins of the sources start overlapping too much - as generally seen in music mixtures when using the classic Short-Time Fourier Transform (STFT), peaks start to fuse in the 2d histogram, so that DUET cannot perform separation effectively.

We proposed to improve peak/source separation in DUET by building the 2d histogram from an alternative time-frequency representation based on the Constant Q Transform (CQT). Unlike the Fourier Transform, the CQT has a logarithmic frequency resolution, mirroring the human auditory system and matching the geometrically spaced frequencies of the Western music scale, therefore better adapted to music mixtures. We also proposed other contributions to enhance DUET, such as adaptive boundaries for the 2d histogram to improve peak resolving when sources are spatially too close to each other, and Wiener filtering to improve source reconstruction. Experiments on mixtures of piano notes and harmonic sources showed that peak/source separation is overall improved, especially at low octaves (under 200 Hz) and for small mixing angles (under pi/6 rad). Experiments on mixtures of female and male speech showed that the use of CQT gives equally good results.

Unlike the classic DUET based on the Fourier Transform, DUET combined with the CQT can resolve adjacent pitches in low octaves as well as in high octaves thanks to the log frequency resolution of the CQT:
[mp3] Mixture of 3 piano notes: A2, Bb2, and B2
[mp3] 1. Original A2      [mp3] 1. Estimated A2
[mp3] 2. Original Bb2    [mp3] 2. Estimated Bb2
[mp3] 3. Original B2      [mp3] 3. Estimated B2

DUET combined with the CQT and adaptive boundaries helps to improve separation when sources have low pitches (for example here between the two cellos) and/or are spatially too close to each other:
[mp3] Mixture of 4 harmonic sources
[mp3] 1. Original cello 1    [mp3] 1. Estimated cello 1
[mp3] 2. Original cello 2    [mp3] 2. Estimated cello 2
[mp3] 3. Original flute       [mp3] 3. Estimated flute
[mp3] 4. Original strings    [mp3] 4. Estimated strings

More information about this project can also be found on the website of the Interactive Audio Lab.

[pdf] Zafar Rafii and Bryan Pardo. "Degenerate Unmixing Estimation Technique using the Constant Q Transform," 36th International Conference on Acoustics, Speech and Signal Processing, Prague, Czech Republic, May 22-27 2011. (poster)

*This work was supported by National Science Foundation grant numbers IIS-0757544 and IIS-0643752.

REpeating Pattern Extraction Technique (REPET)

This is my thesis work; please see the REPET tab.

Audio Fingerprinting for Cover Identification

Suppose that you are at a music festival checking on an artist, and you would like to quickly know about the song that is being played (e.g., title, lyrics, album, etc.). If you have a smartphone, you could record a sample of the live performance and compare it against a database of existing recordings from the artist. Services such as Shazam or SoundHound will not work here, as this is not the typical framework for audio fingerprinting or query-by-humming systems, as a live performance is neither identical to its studio version (e.g., variations in instrumentation, key, tempo, etc.) nor it is a hummed or sung melody. We propose an audio fingerprinting system that can deal with live version identification by using image processing techniques. Compact fingerprints are derived using a log-frequency spectrogram and an adaptive thresholding method, and template matching is performed using the Hamming similarity and the Hough Transform. 

[pdf] Zafar Rafii, Bob Coover, and Jinyu Han. “An Audio Fingerprinting System for Live Version Identification using Image Processing Techniques,” 39th International Conference on Acoustics, Speech and Signal Processing, Florence, Italy, May 4-9 2014. (poster)

*This work was performed during an internship at Gracenote, a leading company in music and video recognition.