Contact Information

Mobile:
1-847-530-7762
Email:
zhiyaoduan00 <at> gmail <dot> com

Research

Score Alignment and Following (Jul. 2009 - present)

Score alignment involves finding the best alignment between an audio performance and the events in a machine-readable music score. Score alignment can be addressed offline or online. An offline algorithm can use the whole performance of a music piece. The online version (also called score following) cannot "look ahead" at future performance events when aligning the current event to the score.

We present a novel online audio-score alignment approach for multiinstrument polyphonic music. This approach uses a 2-dimensional state vector to model the underlying score position and tempo of each time frame of the audio performance. The process model is defined by dynamic equations to transition between states. Two representations of the observed audio frame are proposed, resulting in two observation models: a multi-pitch-based and a chroma-based. Particle filtering is used to infer the hidden states from observations. Experiments on 150 music pieces with polyphony from one to four show the proposed approach outperforms an existing offline global string alignment-based score alignment approach. Results also show that the multi-pitch-based observation model works better than the chroma-based one.

We then extend the score follower to an online score-informed source separation system, called Soundprism. In building the source separator, we first refine the score-informed pitches of the current audio frame by maximizing the multipitch observation likelihood. Then, the harmonics of each source¡¯s fundamental frequency are extracted to reconstruct the source signal. Overlapping harmonics between sources are identified and their energy is distributed in inverse proportion to the square of their respective harmonic number. Experiments on both synthetic and human-performed music show both the score follower and the source separator perform well.

Related Papers:
[1] Zhiyao Duan and Bryan Pardo, "A state space model for online polyphonic audio-score alignment," in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011. <pdf> <poster> <sound files>
[2] Zhiyao Duan and Bryan Pardo, "Soundprism: an online system for score-informed source separation of music audio," IEEE Journal of Selected Topics in Signal Process., in press. <pdf> <sound files>

Multi-pitch Estimation and Tracking (Jul. 2006 - present)

Multi-pitch (fundamental frequency, F0) estimation is to estimate the pitches of a polyphonic music audio in each time frame. Multi-pitch Tracking is to connect the pitch estimates in different frames to get a pitch trajectory for each source (instrument, voice). This is one of the most fundamental problems in the Music Information Retrieval (MIR) area. Although pitch detection and tracking techniques for monophonic audio is robust, pitch estimation and tracking for polyphonic audio is still an open problem, where computer algorithms are far behind human ability in both accuracy and robustness.

For Multi-pitch Estimation, we present a maximum likelihood approach for a mixture of harmonic sound sources, where the amplitude spectrum of a time frame is the observation and the F0s are the parameters to be estimated. When defining the likelihood model, previous methods only model spectral peaks, while the proposed method also models non-peak regions (frequencies further than a musical quarter-step from all observed peaks). It is shown that the peak likelihood and the non-peak region likelihood act as a complementary pair. The former helps find F0s that have harmonics that explain peaks, while the latter helps avoid F0s that have harmonics in non-peak regions. Parameters of these models are learned from monophonic and polyphonic training data. This paper proposes an iterative greedy search strategy to estimate F0s one by one, to avoid the combinatorial problem of concurrent F0 estimation. We also propose a polyphony estimation method to terminate the iterative process. Finally, this paper proposes a post-processing method to refine polyphony and F0 estimates using neighboring frames. It is shown that this refinement method eliminates many inconsistent estimation errors. Evaluations are done on ten recorded four-part J. S. Bach chorales. Results show that the proposed method shows superior F0 estimation and polyphony estimation compared to a state-of-the-art algorithm. We also analyze the relative contributions of different components of the proposed method. A more detailed description about this work can found here.

Related Papers:
[1] Zhiyao Duan and Changshui Zhang, "A probabilistic approach to multiple fundamental frequency estimation from the amplitude spectrum peaks," Music, Brain and Cognition workshop in the Twenty-first Annual Conference on Neural Information Processing Systems (NIPS), 2007. <pdf> <slides> <poster>
[2] Zhiyao Duan, Bryan Pardo, and Changshui Zhang, "Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions," IEEE Trans. Audio Speech Language Process., vol. 18, no. 8, pp. 2121-2133, 2010. <pdf> <sound files>
[3] Zhiyao Duan, Jinyu Han, and Bryan Pardo, "Harmonically informed multi-pitch tracking," in Proc. International Conference on Music Information Retrieval (ISMIR), 2009. <pdf> <slides>
[4] Zhiyao Duan, Jinyu Han and Bryan Pardo, "Song-level multi-pitch tracking by heavily constrained clustering," in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2010. <pdf> <slides>

From the Computational Auditory Scence Analysis (CASA) point of view, we proposed an algorithm base on the concept of "partial event and support transfer". "Partial event" is a simplified representation of partials in the spectrogram and is defined like the note event in MIDI. Each partial event is viewed as a F0 candidate, and gets support from higher frequency partial events, according to their time and harmonicity relations. Therefore, support is transfered from higher frequency events to lower frequency events, and ideally concentrated on F0s.

Related paper:
[1] Zhiyao Duan, Dan Zhang, Changshui Zhang, and Zhenwei Shi, "Multi-pitch estimation based on partial event and support transfer," in Proc. International Conference on Multimedia & Expo (ICME), pp.216-219, 2007. <pdf> <poster> <sound files>

Music Similarity Measure and Recommendation (Jul. 2007 - Apr. 2008)

This project was what I worked on in Microsoft Research Asia (MSRA).
Current music recommendation mainly relies on music reviews and listeners' feedbacks, however, there exist numerous songs that are not reviewed or listened. If the similarity among songs can be calculated directly from raw audio, automatic music recommendation will be possible and much more songs will be discovered by listeners. Here one question is that the similarity of music has many aspects. For example, two songs may have similar genres, instruments, but different types of vocals. In this project, we proposed a method to model the similarity between songs in several aspects, including genre, instrument, vocal, tempo, emotion, rhythm, tonality, etc. Each aspect represents an important factor that impacts people's judgement of similarity among songs. These aspects can be modeled individually and then combined together to calculate the similarity matrix, but the relations among the aspects should also be considered. Finally we want to employ techniques such as relative feedback or active learning to adapt the algorithm to each individual listener's interest and similarity judgement.

Related Papers:
[1] Zhiyao Duan, Lie Lu, and Changshui Zhang, "Collective annotation of music from multiple semantic categories," in Proc. International Conference on Music Information Retrieval (ISMIR), 2008. <pdf> <poster>

Tonality Classification (Nov. 2007 - Jan. 2008)

This project was a part of the big project "Music Similarity Measure and Recommendation" on which I was working in Microsoft Research Asia (MSRA).
Traditional tonality mode (major or minor) classification or audio key finding algorithms often rely on detailed annotations of key names of the training songs. However, unlike classical music whose keys are usually explicitly labeled in their titles, key annotation for numerous popular music requires much expert knowledge and immense labor. In contrast, the mode of each song is much easier to label. However, with only modes labeled, traditional approaches to modes modeling cannot be directly applied, due to the lack of the reference point to transpose the chroma features with different keys. This work is to propose an approach for tonality classification of popular music without tonic annotations on the training data. In this work, We proposed an alignment approach to transpose chroma features within each mode to a reference (but unknown) tonic. Then several methods, including Single Profile Correlation (SPC), Multiple Profile Correlation (MPC) and Support Vector Machine (SVM), were exploited to address mode learning and classification.

Related Papers:
[1] Zhiyao Duan, Lie Lu, and Changshui Zhang, "Audio tonality mode classification without tonic annotations, in Proc. International Conference on Multimedia & Expo (ICME), 2008. <pdf> <poster>

Excitation Signal Extraction for Guitar Tones (Apr. 2007 - Jun. 2007)

This project is what I worked on with Nelson Lee and Prof. Julius Smith in the Center for Computer Research in Music and Acoustics (CCRMA) at Stanford University when I was an exchange student.
It was a part of the big project of Guitar Tone Analysis and Synthesis. This work was concerned with extracting excitation signals from recorded plucked string sounds from an acoustic guitar, for the use of tone systhesis. The proposed method was based on removal of spectral peaks, followed by statistical interpolation to reconstruct the excitation spectrum in frequency intervals occluded by partial overtones. Experimental results on synthesized and real tones showed that it outperformed previous methods in removing tonal components in the resulting excitation signal while maintaining a noise-burst like quality.

Related Papers:
[1] Nelson Lee, Zhiyao Duan, and Julius O. Smith, "Excitation signal extraction for guitar tones," in Proc. International Computer Music Conference (ICMC), 2007. <pdf>

Music Source Separation (May. 2006 - Oct. 2007)

Music signal separation is to separate the sound streams of different sources (instruments and singing voices) in the polyphonic music audio. It is highly related to automatic music transcription, which is to convert music audio to symbolic representations, such as MIDI. On one hand, it will be much easier to transcribe if the polyphonic music audio is separated into monophonic streams, because the techniques for monophonic music transcription is much maturer; on the other hand, music transcription results, such as the transcribed pitches can be used to guide the separation of harmonic sources.

Source separation of musical signals is an appealing but difficult problem, especially in the single-channel case. In this paper, an unsupervised single-channel music source separation algorithm based on average harmonic structure modeling is proposed. Under the assumption of playing in narrow pitch ranges, different harmonic instrumental sources in a piece of music often have different but stable harmonic structures, thus sources can be characterized uniquely by harmonic structure models. Given the number of instrumental sources, the proposed algorithm learns these models directly from the mixed signal by clustering the harmonic structures extracted from different frames. The corresponding sources are then extracted from the mixed signal using the models. Experiments on several mixed signals, including synthesized instrumental sources, real instrumental sources and singing voices, show that this algorithm outperforms the general Nonnegative Matrix Factorization (NMF)-based source separation algorithm, and yields good subjective listening quality. As a side-effect, this algorithm estimates the pitches of the harmonic instrumental sources. The number of concurrent sounds in each frame is also computed, which is a difficult task for general Multi-pitch Estimation (MPE) algorithms. Here are the extensive experimental results (including sound files).

Related Papers:
[1] Zhiyao Duan, Yungang Zhang, Changshui Zhang and Zhenwei Shi, "Unsupervised monaural music source separation by average harmonic structure modeling," IEEE Trans. Audio Speech Language Process., vo. 16, no. 4, pp. 766-778, 2008. <pdf> <sound files>


Last updated undefined, June 8, 2011