“CN-Celeb”版本间的差异

来自cslt Wiki
跳转至: 导航搜索
第11行: 第11行:
  
 
* Collect audio data of 1,000 Chinese celebrities.
 
* Collect audio data of 1,000 Chinese celebrities.
* Automatically clip videoes through a pipeline including face detection, face recognition, speaker validation and speaker diarization.
+
* Automatically clip videos through a pipeline including face detection, face recognition, speaker validation and speaker diarization.
 
* Create a benchmark database for speaker recognition community.
 
* Create a benchmark database for speaker recognition community.
  
第25行: 第25行:
  
 
===GitHub of This Project===
 
===GitHub of This Project===
 +
 
[https://github.com/celebrity-audio-collection/videoprocess celebrity-audio-collection]
 
[https://github.com/celebrity-audio-collection/videoprocess celebrity-audio-collection]
  
 
===Reports===
 
===Reports===
 +
 
[http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/%E6%96%87%E4%BB%B6:C-STAR.pdf Stage report v1.0]
 
[http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/%E6%96%87%E4%BB%B6:C-STAR.pdf Stage report v1.0]
  
第45行: 第47行:
  
 
===References===
 
===References===
 +
 
* Deng et al., "RetinaFace: Single-stage Dense Face Localisation in the Wild", 2019. [https://arxiv.org/pdf/1905.00641.pdf]
 
* Deng et al., "RetinaFace: Single-stage Dense Face Localisation in the Wild", 2019. [https://arxiv.org/pdf/1905.00641.pdf]
 
* Deng et al., "ArcFace: Additive Angular Margin Loss for Deep Face Recognition", 2018, [https://arxiv.org/abs/1801.07698]
 
* Deng et al., "ArcFace: Additive Angular Margin Loss for Deep Face Recognition", 2018, [https://arxiv.org/abs/1801.07698]

2019年10月31日 (四) 07:29的版本

Introduction

  • CN-Celeb, a large-scale Chinese celebrities dataset published by Center for Speech and Language Technology (CSLT) at Tsinghua University.

Members

  • Current:Dong Wang, Yunqi Cai, Lantian Li, Yue Fan, Jiawen Kang
  • History:Ziya Zhou, Kaicheng Li, Haolin Chen, Sitong Cheng, Pengyuan Zhang

Description

  • Collect audio data of 1,000 Chinese celebrities.
  • Automatically clip videos through a pipeline including face detection, face recognition, speaker validation and speaker diarization.
  • Create a benchmark database for speaker recognition community.

Basic Methods

  • Environments: Tensorflow, PyTorch, Keras, MxNet
  • Face detection and tracking: RetinaFace and ArcFace models.
  • Active speaker verification: SyncNet model.
  • Speaker diarization: UIS-RNN model.
  • Double check by speaker recognition: VGG model.
  • Input: pictures and videos of POIs (Persons of Interest).
  • Output: well-labelled videos of POIs (Persons of Interest).

GitHub of This Project

celebrity-audio-collection

Reports

Stage report v1.0

Download

Publications

Future Plans

  • Augment the database to 10,000 people.
  • Build a model between SyncNet and Speaker_Diarization based on LSTM, which can learn the relationship of them.

License

  • All the resources contained in the database are free for research institutes and individuals.
  • No commerical usage is permitted.

References

  • Deng et al., "RetinaFace: Single-stage Dense Face Localisation in the Wild", 2019. [1]
  • Deng et al., "ArcFace: Additive Angular Margin Loss for Deep Face Recognition", 2018, [2]
  • Wang et al., "CosFace: Large Margin Cosine Loss for Deep Face Recognition", 2018, [3]
  • Liu et al., "SphereFace: Deep Hypersphere Embedding for Face Recognition", 2017[4]
  • Zhong et al., "GhostVLAD for set-based face recognition", 2018. [5]
  • Chung et al., "Out of time: automated lip sync in the wild", 2016.[6]
  • Xie et al., "Utterance-level Aggregation For Speaker Recognition In The Wild", 2019. [7]
  • Zhang1 et al., "Fully Supervised Speaker Diarization", 2018. [8]