“CN-Celeb”版本间的差异
来自cslt Wiki
第48行: | 第48行: | ||
* Local (not recommended) | * Local (not recommended) | ||
wav.tgz: [http://cslt.riit.tsinghua.edu.cn/~data/CN-Celeb/wav.tgz wav.tgz] | wav.tgz: [http://cslt.riit.tsinghua.edu.cn/~data/CN-Celeb/wav.tgz wav.tgz] | ||
+ | |||
info.txt: [http://cslt.riit.tsinghua.edu.cn/~data/CN-Celeb/info.txt info.txt] | info.txt: [http://cslt.riit.tsinghua.edu.cn/~data/CN-Celeb/info.txt info.txt] | ||
+ | |||
about.html: [http://cslt.riit.tsinghua.edu.cn/~data/CN-Celeb/about.html about.html] | about.html: [http://cslt.riit.tsinghua.edu.cn/~data/CN-Celeb/about.html about.html] | ||
+ | |||
index.html: [http://cslt.riit.tsinghua.edu.cn/~data/CN-Celeb/index.html index.html] | index.html: [http://cslt.riit.tsinghua.edu.cn/~data/CN-Celeb/index.html index.html] | ||
2019年11月14日 (四) 01:52的版本
目录
Introduction
- CN-Celeb, a large-scale Chinese celebrities dataset published by Center for Speech and Language Technology (CSLT) at Tsinghua University.
Members
- Current:Dong Wang, Yunqi Cai, Lantian Li, Yue Fan, Jiawen Kang
- History:Ziya Zhou, Kaicheng Li, Haolin Chen, Sitong Cheng, Pengyuan Zhang
Description
- Collect audio data of 1,000 Chinese celebrities.
- Automatically clip videos through a pipeline including face detection, face recognition, speaker validation and speaker diarization.
- Create a benchmark database for speaker recognition community.
Basic Methods
- Environments: Tensorflow, PyTorch, Keras, MxNet
- Face detection and tracking: RetinaFace and ArcFace models.
- Active speaker verification: SyncNet model.
- Speaker diarization: UIS-RNN model.
- Double check by speaker recognition: VGG model.
- Input: pictures and videos of POIs (Persons of Interest).
- Output: well-labelled videos of POIs (Persons of Interest).
Reports
Publications
@misc{fan2019cnceleb, title={CN-CELEB: a challenging Chinese speaker recognition dataset}, author={Yue Fan and Jiawen Kang and Lantian Li and Kaicheng Li and Haolin Chen and Sitong Cheng and Pengyuan Zhang and Ziya Zhou and Yunqi Cai and Dong Wang}, year={2019}, eprint={1911.01799}, archivePrefix={arXiv}, primaryClass={eess.AS} }
Source Code
- Collection Pipeline: celebrity-audio-collection
- Baseline Systems: kaldi-cn-celeb
Download
- Local (not recommended)
wav.tgz: wav.tgz
info.txt: info.txt
about.html: about.html
index.html: index.html
- Public (recommended)
OpenSLR: http://www.openslr.org/82/
Future Plans
- Augment the database to 10,000 people.
- Build a model between SyncNet and Speaker_Diarization based on LSTM, which can learn the relationship of them.
License
- All the resources contained in the database are free for research institutes and individuals.
- No commerical usage is permitted.
References
- Deng et al., "RetinaFace: Single-stage Dense Face Localisation in the Wild", 2019. [1]
- Deng et al., "ArcFace: Additive Angular Margin Loss for Deep Face Recognition", 2018, [2]
- Wang et al., "CosFace: Large Margin Cosine Loss for Deep Face Recognition", 2018, [3]
- Liu et al., "SphereFace: Deep Hypersphere Embedding for Face Recognition", 2017[4]
- Zhong et al., "GhostVLAD for set-based face recognition", 2018. [5]
- Chung et al., "Out of time: automated lip sync in the wild", 2016.[6]
- Xie et al., "Utterance-level Aggregation For Speaker Recognition In The Wild", 2019. [7]
- Zhang1 et al., "Fully Supervised Speaker Diarization", 2018. [8]