“CN-CVS”版本间的差异

2022年10月30日 (日) 11:47的最后版本

Collect audio and video data of more than 2500 Mandarin speakers.
Automatically clip videos through a pipeline including shot detection, VAD, face detection, face tracker, audio-visual synchronization detection.
Manually annotate speaker identity, human check data quality.
Create a benchmark database for video to speech synthesis task.

All the resources contained in the database are free for research institutes and individuals.
No commerical usage is permitted.

@@ 第1行： / 第1行： @@
 ===Introduction===
-* Mandarin Visual Speech, a large-scale Chinese Mandarin audio-visual dataset published by Center for Speech and Language Technology (CSLT) at Tsinghua University.
+* CN-CVS, a large-scale Chinese Mandarin audio-visual dataset published by Center for Speech and Language Technology (CSLT) at Tsinghua University.
 ===Members===
@@ 第32行： / 第32行： @@
 ===Source Code===
-* Collection Pipeline: TODO
+* Collection Pipeline: https://github.com/sectum1919/cncvs_data_collector
+* xTS: TODO
+* VCA-GAN: TODO
 ===Download===
 * Public (recommended)
-TODO
+https://cloud.tsinghua.edu.cn/d/83f13126daec49deb8a3/
 * Local (not recommended)
-TODO
+https://cloud.tsinghua.edu.cn/d/83f13126daec49deb8a3/
 ===Future Plans===
+* Extract text transcription via OCR & ASR & Human check
+* Extend baseline to benchmark
 ===License===