ISCSLP Tutorial 2
来自cslt Wiki
Prof. Chung-Hsien
- Arousal & Valence coordinator
- separate emotion process to sub emotions
- available databases:
- database collection:
- acted : GEneva multimodeal emotion portrayals (GEMEP)
- induced : eNTERFACE'05 EMOTION Database
- spontaneous: SEMAINE, AFEW
- others: RML,VAM ,FAU AUBO,SAVEE,TUMAVIC,IEMOCAP,SEMAINE MHMC
- static vs dynamic modeling
STATIC:
- low level descriptors (LLDs) and functionals
- good for discriminate high and low-arousal emotions
- temporal information is lost, no suitable for long utterances, can not detect change in emotion
DYNAMIC:
- frame as the basis, LLDs are extracted and modeled by GMMs, HMMs, DTW
- temporal information is obtained
- difficult to model context well
- a large number of local features need to be extracted,
- Unit choice for dynamic modeling
- technical unit: frame, time slice, equally-divided unit
- meaningful unit: word, syllable, phrases
- emotionally consistent unit: emotion profiles, emotograms
- different aspects of speech tasks place in different scale
- feature concatenation or decision fusion to exploit the information from segmented units
- speech features:
- prosody feature, pitch, formants, energy, speaking rate, good arosal emotions
- ZCR, RMS energy, F0, harmonic noise ratio, MFCC
- MFCC
- Teager feature is good for detecting streess
- recognition models
- SVM, ANN, HMM, GMM, CART
- Emotion distillation framework
- emotion specific features from the original high-dimensional feature
- from speech signals, using SVM to generate emotiongrams, and then use HMM, n-gram, LDA, simple sum, give emotion output
- Hierarchical classification structure
- first detect high/low arosal
- Fusion based recognition
- Feature level fusion
- decision level fusion
- Model based fusion: mutli stream HMM
- Temporal phase-based modeling
- divide the emotion into onset, apex, offset
- using HMM to chracterize one emotional sub-state, instead of the entire emotional state
- totally 6 states: (onset,apex, offset) X (high, low)
- Temporal course modeling
- Structure-based modeling
- three level units: utterance, emotion units, sub emotion units
- use statistic model among different levels
Hsin-Min Wang
- Music information retrieval (MIR)
- title search
- search by query:
- emotion of songs labelled by persons forms a Gaussian
- represent the aoustic features of a song by a probabilistic history vector
- acoustic GMM posterior representation as a feature
- GMM code book constructed in training (VA GMM)
- can put the tag into VA space
- Video to Audio Retrieval
- First predict video emotion
- put audio
- this can be reverse
Emotion variability, by Prof. Vidhyasaharan Sethu:
- GMM supervector based emotion
- t-SNNE for visualization in 2-D space
- remove phone variability by phone dependent GMMs
- speaker normalization is important for emotion recognition
- two ways: speaker adaptation & speaker signal
- KL-based estimation on speaker and emotion variability
- speaker normalization by feature warping
:* speaker variation modeling with JFA
- Speaker adaptation : speaker library
- Cognitive load by Julien Epps
- cognitive load = arousal?
- load measure: analytical measure (number of ++); physical measure: EEG, ECG/HRV, GSR, respiration; task measure: speech, drawing...
- Glottal features
- SDC: a more logn-gap mfcc data, quite similar to delta_MFCC, however long shift
- Future: relationship between cognitive load vs arousal; mutimodal data, improve discrimination, test under less constrained conditions