“Deep Speech Factorization-2”版本间的差异

来自cslt Wiki
跳转至: 导航搜索
第48行: 第48行:
  
 
=Further reading=
 
=Further reading=
 +
 +
===ML===
 +
 +
# Goodfellow et al., "Bengio. Generative adversarial nets. In Advances in neural information processing systems,", 2014
 +
# Kingma, et al., "Auto-encoding variational Bayes". 2014.
 +
# Kingma et al., "Improving Variational Inference with Inverse Autoregressive Flow", 2016
 +
# Danilo Jimenez Rezende et al., "Variational Inference with Normalizing Flows", 2016
 +
# Zhu et al., "Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks", 2017
 +
# Chen et al., "Infogan: Interpretable representation learning by information maximizing generative adversarial nets", 2016
 +
# Hu et al., "On unifying deep generative models", 2017
 +
# Makhzani, "Adversarial Autoencoders", 2015
 +
 +
===TTS===
  
 
# Wang,  et al., "Tacotron: A fully end-to-end text-to-speech synthesis model." CoRR, abs/1703.10135, 2017.
 
# Wang,  et al., "Tacotron: A fully end-to-end text-to-speech synthesis model." CoRR, abs/1703.10135, 2017.
第53行: 第66行:
 
# van den Oord, et al., "WaveNet: A generative model for raw audio". CoRR, abs/1609.03499, 2016a
 
# van den Oord, et al., "WaveNet: A generative model for raw audio". CoRR, abs/1609.03499, 2016a
 
# Nal Kalchbrenner et al., "Efficient Neural Audio Synthesis", 2018 (WaveRNN)
 
# Nal Kalchbrenner et al., "Efficient Neural Audio Synthesis", 2018 (WaveRNN)
# Kingma et al., "Improving Variational Inference with Inverse Autoregressive Flow", 2016
 
# Kingma, et al., "Auto-encoding variational Bayes". 2014.
 
# Danilo Jimenez Rezende et al., "Variational Inference with Normalizing Flows", 2016
 
# Zhu et al., "Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks",
 

2019年7月24日 (三) 00:42的版本

Introduction

Speech signals involve complex factors, each contributing in an unknown and secrete way. Recent developed deep learning methods have built up some interesting tools for discovering these latent factors. These tools include various unsupervised models such as VAE, GAN, supervised learning methods such as multi-task learning, knowledge distillation, etc. These tools allow us to decipher secretes of speech signal, based on big data, rather than hypothesis.

These will lead to an unprecedented breakthrough in speech information processing. Some of the signals for this breakthrough includes:

  • In speaker recognition, speaker factors can be learned within a very small speech segment.
  • In speech synthesis, speaking styles can be learned as latent variables and discovered in an unsupervised way, and speaker factors can be used to change the speaker trait.
  • In speech recognition, learning multiple tasks in a collaborative way has shown to be successful.

In previous studies (Phase 1), we have found that using cascade learning, speech signals can be factorized into content, speaker and emotion at the frame level. In this Phase 2, we will try to answer the following questions:

  • Can we factorize speech signals in an unsupervised way?
  • How supervised and unsupervised factorizations are integrated?
  • How to deal with language discrepancy in factorization?
  • How to discover optimal factorization architectures?

People

Dong Wang, Yunqi Cai, Haoran Sun

Research direction

Basic research

  • Collaborative learning with AutoML
  • VAE/dVAE factorization
  • Supervised VAE for factorization
  • ASR + TTS cycle training

Applied reseach

  • Pretraining for ASR, SID, EMD (BERT in speech)
  • Low-resource ASR, TTS
  • Signal compression, cleaning up, etc.


Related publications

  1. Yang Zhang and Lantian Li and Dong Wang, "VAE-based regularization for deep speaker embedding", Interspeech 2019
  2. Lantian Li, Yixiang Chen, Ying Shi, Zhiyuan Tang, and Dong Wang, “Deep speaker feature learning for text-independent speaker verification,”, Interspeech 2017.
  3. Lantian Li, Dong Wang, Yixiang Chen, Ying Shing, Zhiyuan Tang, http://wangd.cslt.org/public/pdf/spkfact.pdf
  4. Lantian Li, Zhiyuan Tang, Dong Wang, FULL-INFO TRAINING FOR DEEP SPEAKER FEATURE LEARNING, http://wangd.cslt.org/public/pdf/mlspk.pdf
  5. Zhiyuan Thang, Lantian Li, Dong Wang, Ravi Vipperla "Collaborative Joint Training with Multi-task Recurrent Model for Speech and Speaker Recognition", IEEE Trans. on Audio, Speech and Language Processing, vol. 25, no.3, March 2017.
  6. Dong Wang,Lantian Li,Ying Shi,Yixiang Chen,Zhiyuan Tang., "Deep Factorization for Speech Signal", https://arxiv.org/abs/1706.01777

Further reading

ML

  1. Goodfellow et al., "Bengio. Generative adversarial nets. In Advances in neural information processing systems,", 2014
  2. Kingma, et al., "Auto-encoding variational Bayes". 2014.
  3. Kingma et al., "Improving Variational Inference with Inverse Autoregressive Flow", 2016
  4. Danilo Jimenez Rezende et al., "Variational Inference with Normalizing Flows", 2016
  5. Zhu et al., "Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks", 2017
  6. Chen et al., "Infogan: Interpretable representation learning by information maximizing generative adversarial nets", 2016
  7. Hu et al., "On unifying deep generative models", 2017
  8. Makhzani, "Adversarial Autoencoders", 2015

TTS

  1. Wang, et al., "Tacotron: A fully end-to-end text-to-speech synthesis model." CoRR, abs/1703.10135, 2017.
  2. van den Oord, et al., "Parallel WaveNet: Fast high-fidelity speech synthesis.", CoRR, abs/1711.10433, 2017.
  3. van den Oord, et al., "WaveNet: A generative model for raw audio". CoRR, abs/1609.03499, 2016a
  4. Nal Kalchbrenner et al., "Efficient Neural Audio Synthesis", 2018 (WaveRNN)