ISCSLP Tutorial 2

来自cslt Wiki

跳转至：导航、搜索

Prof. Chung-Hsien

Arousal & Valence coordinator
separate emotion process to sub emotions

available databases:
database collection:

acted : GEneva multimodeal emotion portrayals (GEMEP)
induced : eNTERFACE'05 EMOTION Database
spontaneous: SEMAINE, AFEW

others: RML,VAM ,FAU AUBO,SAVEE,TUMAVIC,IEMOCAP,SEMAINE MHMC

static vs dynamic modeling

STATIC:

low level descriptors (LLDs) and functionals
good for discriminate high and low-arousal emotions
temporal information is lost, no suitable for long utterances, can not detect change in emotion

DYNAMIC:

frame as the basis, LLDs are extracted and modeled by GMMs, HMMs, DTW
temporal information is obtained
difficult to model context well
a large number of local features need to be extracted,

Unit choice for dynamic modeling

technical unit: frame, time slice, equally-divided unit
meaningful unit: word, syllable, phrases
emotionally consistent unit: emotion profiles, emotograms
different aspects of speech tasks place in different scale

feature concatenation or decision fusion to exploit the information from segmented units

speech features:

prosody feature, pitch, formants, energy, speaking rate, good arosal emotions
ZCR, RMS energy, F0, harmonic noise ratio, MFCC
MFCC
Teager feature is good for detecting streess

recognition models

SVM, ANN, HMM, GMM, CART

Emotion distillation framework

emotion specific features from the original high-dimensional feature
from speech signals, using SVM to generate emotiongrams, and then use HMM, n-gram, LDA, simple sum, give emotion output

Hierarchical classification structure

first detect high/low arosal

Fusion based recognition

Feature level fusion
decision level fusion
Model based fusion: mutli stream HMM

Temporal phase-based modeling

divide the emotion into onset, apex, offset
using HMM to chracterize one emotional sub-state, instead of the entire emotional state
totally 6 states: (onset,apex, offset) X (high, low)
Temporal course modeling

Structure-based modeling

three level units: utterance, emotion units, sub emotion units
use statistic model among different levels

Hsin-Min Wang

Music information retrieval (MIR)

title search
search by query:
emotion of songs labelled by persons forms a Gaussian
represent the aoustic features of a song by a probabilistic history vector
acoustic GMM posterior representation as a feature
GMM code book constructed in training (VA GMM)
can put the tag into VA space

Video to Audio Retrieval

First predict video emotion
put audio
this can be reverse

Emotion variability, by Prof. Vidhyasaharan Sethu:

GMM supervector based emotion

t-SNNE for visualization in 2-D space
remove phone variability by phone dependent GMMs
speaker normalization is important for emotion recognition
two ways: speaker adaptation & speaker signal
KL-based estimation on speaker and emotion variability
speaker normalization by feature warping

:* speaker variation modeling with JFA

Speaker adaptation : speaker library

Cognitive load by Julien Epps

cognitive load = arousal?
load measure: analytical measure (number of ++); physical measure: EEG, ECG/HRV, GSR, respiration; task measure: speech, drawing...
Glottal features
SDC: a more logn-gap mfcc data, quite similar to delta_MFCC, however long shift

Future: relationship between cognitive load vs arousal; mutimodal data, improve discrimination, test under less constrained conditions

取自“http://cslt.org/mediawiki/index.php?title=ISCSLP_Tutorial_2&oldid=11343”