|

Contact Information
Mobile:
1-847-530-7762
Email:
zhiyaoduan00 <at> gmail <dot> com
|
Research

Score Alignment and Following (Jul. 2009 - present)
 |
Score alignment involves finding
the best alignment between an audio performance and the events
in a machine-readable music score. Score alignment can be addressed
offline or online. An offline algorithm can use the whole performance
of a music piece. The online version (also called score following)
cannot "look ahead" at future performance events when
aligning the current event to the score. |
We present a novel online audio-score alignment approach for
multiinstrument polyphonic music. This approach uses a 2-dimensional
state vector to model the underlying score position and tempo
of each time frame of the audio performance. The process model
is defined by dynamic equations to transition between states.
Two representations of the observed audio frame are proposed,
resulting in two observation models: a multi-pitch-based and
a chroma-based. Particle filtering is used to infer the hidden
states from observations. Experiments on 150 music pieces with
polyphony from one to four show the proposed approach outperforms
an existing offline global string alignment-based score alignment
approach. Results also show that the multi-pitch-based observation
model works better than the chroma-based one.
We then extend the score follower to an online score-informed
source separation system, called Soundprism.
In building the source separator, we first refine the score-informed
pitches of the current audio frame by maximizing the multipitch
observation likelihood. Then, the harmonics of each source¡¯s
fundamental frequency are extracted to reconstruct the source
signal. Overlapping harmonics between sources are identified
and their energy is distributed in inverse proportion to the
square of their respective harmonic number. Experiments on both
synthetic and human-performed music show both the score follower
and the source separator perform well.
Related Papers:
[1] Zhiyao Duan and Bryan Pardo, "A state
space model for online polyphonic audio-score alignment,"
in Proc. IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), 2011. <pdf>
<poster>
<sound
files>
[2] Zhiyao Duan and Bryan Pardo, "Soundprism:
an online system for score-informed source separation of music
audio," IEEE Journal of Selected Topics in Signal Process.,
in press. <pdf>
<sound
files>
|

Multi-pitch Estimation and Tracking (Jul. 2006 - present)
|
Multi-pitch (fundamental frequency, F0) estimation is to estimate the pitches of a polyphonic music audio in each time frame. Multi-pitch Tracking is to connect the pitch estimates in different frames to get a pitch trajectory for each source (instrument, voice). This is one of the most fundamental problems in the Music Information Retrieval (MIR) area. Although pitch detection and tracking techniques for monophonic audio is robust, pitch estimation and tracking for polyphonic audio is still an open problem, where computer algorithms are far behind human ability in both accuracy and robustness. |
For Multi-pitch Estimation, we present a maximum likelihood approach for a mixture
of harmonic sound sources, where the amplitude spectrum of a
time frame is the observation and the F0s are the parameters
to be estimated. When defining the likelihood model, previous
methods only model spectral peaks, while the proposed method
also models non-peak regions (frequencies further than a musical
quarter-step from all observed peaks). It is shown that the peak
likelihood and the non-peak region likelihood act as a complementary
pair. The former helps find F0s that have harmonics
that explain peaks, while the latter helps avoid F0s that have
harmonics in non-peak regions. Parameters of these models are
learned from monophonic and polyphonic training data. This
paper proposes an iterative greedy search strategy to estimate
F0s one by one, to avoid the combinatorial problem of concurrent
F0 estimation. We also propose a polyphony estimation method
to terminate the iterative process. Finally, this paper proposes
a post-processing method to refine polyphony and F0 estimates
using neighboring frames. It is shown that this refinement method
eliminates many inconsistent estimation errors. Evaluations are
done on ten recorded four-part J. S. Bach chorales. Results show
that the proposed method shows superior F0 estimation and
polyphony estimation compared to a state-of-the-art algorithm.
We also analyze the relative contributions of different
components of the proposed method. A more detailed description about this work can found here.
Related Papers:
[1] Zhiyao Duan and Changshui
Zhang, "A probabilistic approach to multiple fundamental
frequency estimation from the amplitude spectrum peaks,"
Music, Brain and Cognition workshop in the Twenty-first
Annual Conference on Neural Information Processing Systems (NIPS),
2007. <pdf> <slides>
<poster>
[2] Zhiyao Duan, Bryan
Pardo, and Changshui Zhang, "Multiple fundamental frequency
estimation by modeling spectral peaks and non-peak regions,"
IEEE Trans. Audio Speech Language Process., vol. 18, no. 8, pp. 2121-2133, 2010. <pdf>
<sound
files>
[3] Zhiyao Duan, Jinyu
Han, and Bryan Pardo, "Harmonically informed multi-pitch
tracking," in Proc. International Conference on Music
Information Retrieval (ISMIR), 2009. <pdf>
<slides>
[4] Zhiyao Duan, Jinyu
Han and Bryan Pardo, "Song-level multi-pitch tracking by
heavily constrained clustering," in Proc. IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP),
2010. <pdf>
<slides>
|
From the Computational Auditory Scence Analysis (CASA) point of view, we proposed an algorithm base on the concept of "partial event and support transfer". "Partial event" is a simplified representation of partials in the spectrogram and is defined like the note event in MIDI. Each partial event is viewed as a F0 candidate, and gets support from higher frequency partial events, according to their time and harmonicity relations. Therefore, support is transfered from higher frequency events to lower frequency events, and ideally concentrated on F0s.
Related paper:
[1] Zhiyao Duan, Dan Zhang, Changshui Zhang, and Zhenwei Shi, "Multi-pitch estimation based on partial event and support transfer," in Proc. International Conference on Multimedia & Expo (ICME), pp.216-219, 2007. <pdf> <poster> <sound files>
|

Music Similarity Measure and Recommendation (Jul. 2007 - Apr. 2008)
 |
This project was what I worked on in Microsoft Research Asia (MSRA). |
Current music recommendation mainly relies on music reviews and listeners' feedbacks, however, there exist numerous songs that are not reviewed or listened. If the similarity among songs can be calculated directly from raw audio, automatic music recommendation will be possible and much more songs will be discovered by listeners. Here one question is that the similarity of music has many aspects. For example, two songs may have similar genres, instruments, but different types of vocals. In this project, we proposed a method to model the similarity between songs in several aspects, including genre, instrument, vocal, tempo, emotion, rhythm, tonality, etc. Each aspect represents an important factor that impacts people's judgement of similarity among songs. These aspects can be modeled individually and then combined together to calculate the similarity matrix, but the relations among the aspects should also be considered. Finally we want to employ techniques such as relative feedback or active learning to adapt the algorithm to each individual listener's interest and similarity judgement.
Related Papers:
[1] Zhiyao Duan, Lie Lu, and Changshui Zhang, "Collective annotation of music from multiple semantic categories," in Proc. International Conference on Music Information Retrieval (ISMIR), 2008. <pdf> <poster>
|

Tonality Classification (Nov. 2007 - Jan. 2008)
 |
This project was a part of the big project "Music Similarity Measure and Recommendation" on which I was working in Microsoft Research Asia (MSRA). |
Traditional tonality mode (major or minor) classification or audio key finding algorithms often rely on detailed annotations of key names of the training songs. However, unlike classical music whose keys are usually explicitly labeled in their titles, key annotation for numerous popular music requires much expert knowledge and immense labor. In contrast, the mode of each song is much easier to label. However, with only modes labeled, traditional approaches to modes modeling cannot be directly applied, due to the lack of the reference point to transpose the chroma features with different keys. This work is to propose an approach for tonality classification of popular music without tonic annotations on the training data. In this work, We proposed an alignment approach to transpose chroma features within each mode to a reference (but unknown) tonic. Then several methods, including Single Profile Correlation (SPC), Multiple Profile Correlation (MPC) and Support Vector Machine (SVM), were exploited to address mode learning and classification.
Related Papers:
[1] Zhiyao Duan, Lie Lu, and Changshui Zhang, "Audio tonality mode classification without tonic annotations, in Proc. International Conference on Multimedia & Expo (ICME), 2008. <pdf> <poster>
|

Excitation Signal Extraction for Guitar Tones (Apr. 2007 - Jun. 2007)
 |
This project is what I worked on with Nelson Lee and Prof. Julius Smith in the Center for Computer Research in Music and Acoustics (CCRMA) at Stanford University when I was an exchange student. |
It was a part of the big project of Guitar Tone Analysis and Synthesis. This work was concerned with extracting excitation signals from recorded plucked string sounds from an acoustic guitar, for the use of tone systhesis. The proposed method was based on removal of spectral peaks, followed by statistical
interpolation to reconstruct the excitation spectrum
in frequency intervals occluded by partial overtones. Experimental results on synthesized and real tones showed that it outperformed previous methods
in removing tonal components in the resulting excitation
signal while maintaining a noise-burst like quality.
Related Papers:
[1] Nelson Lee, Zhiyao Duan, and Julius O. Smith, "Excitation signal extraction for guitar tones," in Proc. International
Computer Music Conference (ICMC), 2007. <pdf>
|

Music Source Separation (May. 2006 - Oct. 2007)
 |
Music signal separation is to separate the sound streams of different sources (instruments and singing voices) in the polyphonic music audio. It is highly related to automatic music transcription, which is to convert music audio to symbolic representations, such as MIDI. On one hand, it will be much easier to transcribe if the polyphonic music audio is separated into monophonic streams, because the techniques for monophonic music transcription is much maturer; on the other hand, music transcription results, such as the transcribed pitches can be used to guide the separation of harmonic sources. |
Source separation of musical signals is an appealing
but difficult problem, especially in the single-channel case. In this
paper, an unsupervised single-channel music source separation
algorithm based on average harmonic structure modeling is proposed.
Under the assumption of playing in narrow pitch ranges,
different harmonic instrumental sources in a piece of music often
have different but stable harmonic structures, thus sources can
be characterized uniquely by harmonic structure models. Given
the number of instrumental sources, the proposed algorithm
learns these models directly from the mixed signal by clustering
the harmonic structures extracted from different frames. The
corresponding sources are then extracted from the mixed signal
using the models. Experiments on several mixed signals, including
synthesized instrumental sources, real instrumental sources and
singing voices, show that this algorithm outperforms the general
Nonnegative Matrix Factorization (NMF)-based source separation
algorithm, and yields good subjective listening quality. As a
side-effect, this algorithm estimates the pitches of the harmonic
instrumental sources. The number of concurrent sounds in each
frame is also computed, which is a difficult task for general
Multi-pitch Estimation (MPE) algorithms. Here are the extensive experimental results (including sound files).
Related Papers:
[1] Zhiyao Duan, Yungang Zhang, Changshui Zhang and Zhenwei Shi, "Unsupervised monaural music source separation by average harmonic structure modeling," IEEE Trans. Audio Speech Language Process., vo. 16, no. 4, pp. 766-778, 2008. <pdf> <sound files>
|
|