“CN-Celeb”版本间的差异
来自cslt Wiki
Duwenqiang(讨论 | 贡献) |
|||
(1位用户的23个中间修订版本未显示) | |||
第1行: | 第1行: | ||
===Introduction=== | ===Introduction=== | ||
− | * CN-Celeb, a large-scale Chinese celebrities dataset | + | * CN-Celeb, a large-scale Chinese celebrities dataset published by Center for Speech and Language Technology (CSLT) at Tsinghua University. |
===Members=== | ===Members=== | ||
第8行: | 第8行: | ||
* History:Ziya Zhou, Kaicheng Li, Haolin Chen, Sitong Cheng, Pengyuan Zhang | * History:Ziya Zhou, Kaicheng Li, Haolin Chen, Sitong Cheng, Pengyuan Zhang | ||
− | === | + | ===Description=== |
* Collect audio data of 1,000 Chinese celebrities. | * Collect audio data of 1,000 Chinese celebrities. | ||
− | * Automatically clip | + | * Automatically clip videos through a pipeline including face detection, face recognition, speaker validation and speaker diarization. |
* Create a benchmark database for speaker recognition community. | * Create a benchmark database for speaker recognition community. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
===Basic Methods=== | ===Basic Methods=== | ||
* Environments: Tensorflow, PyTorch, Keras, MxNet | * Environments: Tensorflow, PyTorch, Keras, MxNet | ||
− | * Face detection and tracking | + | * Face detection and tracking: RetinaFace and ArcFace models. |
− | * Active speaker verification | + | * Active speaker verification: SyncNet model. |
− | * Speaker | + | * Speaker diarization: UIS-RNN model. |
− | * Double check by speaker recognition | + | * Double check by speaker recognition: VGG model. |
− | * Input: | + | * Input: pictures and videos of POIs (Persons of Interest). |
* Output: well-labelled videos of POIs (Persons of Interest). | * Output: well-labelled videos of POIs (Persons of Interest). | ||
− | |||
− | |||
− | |||
===Reports=== | ===Reports=== | ||
− | |||
− | + | * [http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/%E6%96%87%E4%BB%B6:C-STAR.pdf Stage report v1.0] | |
===Publications=== | ===Publications=== | ||
+ | <pre> | ||
+ | @misc{fan2019cnceleb, | ||
+ | title={CN-CELEB: a challenging Chinese speaker recognition dataset}, | ||
+ | author={Yue Fan and Jiawen Kang and Lantian Li and Kaicheng Li and Haolin Chen and Sitong Cheng and Pengyuan Zhang and Ziya Zhou and Yunqi Cai and Dong Wang}, | ||
+ | year={2019}, | ||
+ | eprint={1911.01799}, | ||
+ | archivePrefix={arXiv}, | ||
+ | primaryClass={eess.AS} | ||
+ | } | ||
+ | |||
+ | @misc{li2020cn, | ||
+ | title={CN-Celeb: multi-genre speaker recognition}, | ||
+ | author={Lantian Li and Ruiqi Liu and Jiawen Kang and Yue Fan and Hao Cui and Yunqi Cai and Ravichander Vipperla and Thomas Fang Zheng and Dong Wang}, | ||
+ | year={2020}, | ||
+ | eprint={2012.12468}, | ||
+ | archivePrefix={arXiv}, | ||
+ | primaryClass={eess.AS} | ||
+ | } | ||
+ | </pre> | ||
+ | ===Source Code=== | ||
+ | |||
+ | * Collection Pipeline: [https://github.com/celebrity-audio-collection/videoprocess celebrity-audio-collection] | ||
+ | * Baseline Systems: [https://github.com/csltstu/kaldi/tree/cnceleb/egs/cnceleb kaldi-cn-celeb] | ||
+ | |||
+ | ===Download=== | ||
+ | |||
+ | * Public (recommended) | ||
+ | OpenSLR: http://www.openslr.org/82/ | ||
+ | |||
+ | * Local (not recommended) | ||
+ | CSLT@Tsinghua: http://index.cslt.org/~data/CN-Celeb/ | ||
+ | |||
+ | ===Future Plans=== | ||
+ | |||
+ | * Augment the database to 10,000 people. | ||
+ | * Build a model between SyncNet and Speaker_Diarization based on LSTM, which can learn the relationship of them. | ||
+ | |||
+ | ===License=== | ||
+ | |||
+ | * All the resources contained in the database are free for research institutes and individuals. | ||
+ | * <b>No commerical usage is permitted</b>. | ||
===References=== | ===References=== | ||
+ | |||
* Deng et al., "RetinaFace: Single-stage Dense Face Localisation in the Wild", 2019. [https://arxiv.org/pdf/1905.00641.pdf] | * Deng et al., "RetinaFace: Single-stage Dense Face Localisation in the Wild", 2019. [https://arxiv.org/pdf/1905.00641.pdf] | ||
* Deng et al., "ArcFace: Additive Angular Margin Loss for Deep Face Recognition", 2018, [https://arxiv.org/abs/1801.07698] | * Deng et al., "ArcFace: Additive Angular Margin Loss for Deep Face Recognition", 2018, [https://arxiv.org/abs/1801.07698] | ||
* Wang et al., "CosFace: Large Margin Cosine Loss for Deep Face Recognition", 2018, [https://arxiv.org/pdf/1801.09414.pdf] | * Wang et al., "CosFace: Large Margin Cosine Loss for Deep Face Recognition", 2018, [https://arxiv.org/pdf/1801.09414.pdf] | ||
* Liu et al., "SphereFace: Deep Hypersphere Embedding for Face Recognition", 2017[https://arxiv.org/pdf/1704.08063.pdf] | * Liu et al., "SphereFace: Deep Hypersphere Embedding for Face Recognition", 2017[https://arxiv.org/pdf/1704.08063.pdf] | ||
− | * Zhong et al., "GhostVLAD for set-based face recognition", 2018. [http://www.robots.ox.ac.uk/~vgg/publications/2018/Zhong18b/zhong18b.pdf | + | * Zhong et al., "GhostVLAD for set-based face recognition", 2018. [http://www.robots.ox.ac.uk/~vgg/publications/2018/Zhong18b/zhong18b.pdf] |
− | * Chung et al., "Out of time: automated lip sync in the wild", 2016.[http://www.robots.ox.ac.uk/~vgg/publications/2016/Chung16a/chung16a.pdf | + | * Chung et al., "Out of time: automated lip sync in the wild", 2016.[http://www.robots.ox.ac.uk/~vgg/publications/2016/Chung16a/chung16a.pdf] |
− | * Xie et al., " | + | * Xie et al., "Utterance-level Aggregation For Speaker Recognition In The Wild", 2019. [https://arxiv.org/pdf/1902.10107.pdf] |
− | * Zhang1 et al., " | + | * Zhang1 et al., "Fully Supervised Speaker Diarization", 2018. [https://arxiv.org/pdf/1810.04719v1.pdf] |
2024年11月26日 (二) 02:14的最后版本
目录
Introduction
- CN-Celeb, a large-scale Chinese celebrities dataset published by Center for Speech and Language Technology (CSLT) at Tsinghua University.
Members
- Current:Dong Wang, Yunqi Cai, Lantian Li, Yue Fan, Jiawen Kang
- History:Ziya Zhou, Kaicheng Li, Haolin Chen, Sitong Cheng, Pengyuan Zhang
Description
- Collect audio data of 1,000 Chinese celebrities.
- Automatically clip videos through a pipeline including face detection, face recognition, speaker validation and speaker diarization.
- Create a benchmark database for speaker recognition community.
Basic Methods
- Environments: Tensorflow, PyTorch, Keras, MxNet
- Face detection and tracking: RetinaFace and ArcFace models.
- Active speaker verification: SyncNet model.
- Speaker diarization: UIS-RNN model.
- Double check by speaker recognition: VGG model.
- Input: pictures and videos of POIs (Persons of Interest).
- Output: well-labelled videos of POIs (Persons of Interest).
Reports
Publications
@misc{fan2019cnceleb, title={CN-CELEB: a challenging Chinese speaker recognition dataset}, author={Yue Fan and Jiawen Kang and Lantian Li and Kaicheng Li and Haolin Chen and Sitong Cheng and Pengyuan Zhang and Ziya Zhou and Yunqi Cai and Dong Wang}, year={2019}, eprint={1911.01799}, archivePrefix={arXiv}, primaryClass={eess.AS} } @misc{li2020cn, title={CN-Celeb: multi-genre speaker recognition}, author={Lantian Li and Ruiqi Liu and Jiawen Kang and Yue Fan and Hao Cui and Yunqi Cai and Ravichander Vipperla and Thomas Fang Zheng and Dong Wang}, year={2020}, eprint={2012.12468}, archivePrefix={arXiv}, primaryClass={eess.AS} }
Source Code
- Collection Pipeline: celebrity-audio-collection
- Baseline Systems: kaldi-cn-celeb
Download
- Public (recommended)
OpenSLR: http://www.openslr.org/82/
- Local (not recommended)
CSLT@Tsinghua: http://index.cslt.org/~data/CN-Celeb/
Future Plans
- Augment the database to 10,000 people.
- Build a model between SyncNet and Speaker_Diarization based on LSTM, which can learn the relationship of them.
License
- All the resources contained in the database are free for research institutes and individuals.
- No commerical usage is permitted.
References
- Deng et al., "RetinaFace: Single-stage Dense Face Localisation in the Wild", 2019. [1]
- Deng et al., "ArcFace: Additive Angular Margin Loss for Deep Face Recognition", 2018, [2]
- Wang et al., "CosFace: Large Margin Cosine Loss for Deep Face Recognition", 2018, [3]
- Liu et al., "SphereFace: Deep Hypersphere Embedding for Face Recognition", 2017[4]
- Zhong et al., "GhostVLAD for set-based face recognition", 2018. [5]
- Chung et al., "Out of time: automated lip sync in the wild", 2016.[6]
- Xie et al., "Utterance-level Aggregation For Speaker Recognition In The Wild", 2019. [7]
- Zhang1 et al., "Fully Supervised Speaker Diarization", 2018. [8]