Torch speaker

PyTorch Speaker 一个基于 PyTorch 编写的说话人识别科研工具包。

PyTorch Speaker 简介

PyTorch Speaker 是一个基于 PyTorch 编写的说话人识别科研工具包。

项目特点

不依赖于Kaldi
支持离线手机端/嵌入式设备的快速部署
丰富的数据可视化支持

说话人识别(Speaker Recognition, SRE)技术，也称为声纹识别(Voiceprint Recognition, VPR)技术属于生物特征识别技术的一种，是一项根据语音信号中反映说话人生理和行为特征的语音参数(“声纹”)，自动识别说话人身份的技术。说话人识别本质上是一类模式识别问题。说话人识别按照具体场景和需求的不同可以分为如下图所示的3个子任务：

700px

任务中文名称	任务英语名称	中文定义	英语定义
说话人辨认	Speaker Identification		Speaker Identification that identify the true speaker from a set of candidates,
说话人确认	Speaker Verification		Speaker Verification that tests if an alleged speaker is the true speaker.
说话人追踪	Speaker Diarization		Speaker Diarization which addresses the problem of “who spoken and when”, which is a process of partitioning a conversation recording into several speech recordings, each of which belongs to a single speaker.

系统性能

网络结构	网络参数量	损失函数	是否有数据增强	训练数据集	测试数据集	Equal Error Rate	DCF(10-2)	DCF(10-3)

项目结构

.
├── config/  # 存放yaml配置文件
├── docs/    # 存放文档
├── README.md
├── requirements.txt
├── scripts/ # 存放数据处理脚本
├── setup.py 
├── tools/   # 存放训练推理等脚本
└── torch_speaker/ # 模型pipline的主体实现
    ├── backbone/
    ├── data/
    ├── loss/
    ├── score/
    ├── module.py
    └── utils/

快速安装与上手

步骤	使用方法	注意事项
安装	git clone cd torch_speaker pip install -r requirements.txt python setup.py develop
数据准备/预处理	采用pandas构建datlist.csv来实现数据的准备。
Training
Evaluation

语音声学特征的提取

由于torchaudio库中存在一定的bug，且未来我计划采用小波分析的方法(wavelet)进行一些其他的实验，因此提取声学体征的代码也是完全采用PyTorch手动实现（没有依赖于其他第三方库）。

生成缩略图出错：/bin/bash: /usr/bin/convert: No such file or directory

Error code: 127

其中，值得注意的是：

kaldi所实现的特征提取是offline的，采用PyTorch可以实现online的特征提取；
Mel-Spectrogram 也叫Fbank，或是FilterBank，在Kaldi中叫Fbank中比较多，在TTS和VC中Mel-Spectrogram使用的比较多；
由我实现的特征提取代码，kaldi，librosa，torchaudio即便是在相同配置参数的情况下，所提取的到的结果都不同。

特征名	实现流程	代码存放位置
Spectrogram	预加重(Pre-emphasis，弥补了高频部分的损耗，保护声道信息）加窗（hamming窗，降低吉布斯现象）并做短时傅里叶变换(stft) 对stft后的复数结果取模取对数（加1e-9防止出现0） Instance Norm(可以等价于做了cmvn倒谱均值方差归一化)	[link]
Mel-Spectrogram	预加重(Pre-emphasis，弥补了高频部分的损耗，保护声道信息）加窗（hamming窗，降低吉布斯现象）并做短时傅里叶变换(stft) 对stft后的复数结果取模取对数（加1e-9防止出现0） Mel滤波 Instance Norm(可以等价于做了cmvn倒谱均值方差归一化)	[link]
MFCC

常用backbone与实现

backbone这个单词原意指的是人的脊梁骨，后来引申为支柱，核心的意思。在神经网络中，尤其是CV领域，一般先对图像进行特征提取（常见的有vggnet，resnet，谷歌的inception），这一部分是整个CV任务的根基，因为后续的下游任务都是基于提取出来的图像特征去做文章（比如分类，生成等等）。所以将这一部分网络结构称为backbone十分形象，仿佛是一个人站起来的支柱。

ResNet和其变种

400px

TDNN和其变种

常用Loss

后端打分

打分方法	计算公式	代码实现
cosine
PLDA

data loader

数据增强

class imbalance sampler

评价指标计算

评价指标	计算方法
EER
minDCF

对抗样本攻击与防御

对抗样本攻击

攻击方法	计算方法	代码实现
BIM
PGD

对抗样本防御

工具代码和脚本

工具代脚本

脚本名称	实现思路与流程	代码位置
读取waveform	目前各类开源的工具中，语音数据的读取的方法实现主要有两种：一种是以matlab，soundfile为代表的一种是以kaldi，scipy为代表的在本项目中，根据training和evaluation阶段的不同，对语音的读取策略也有所区别。	[link]
读取超参数	超参数的读入参考了nanodet项目的实现，采用了yacs来实现对yaml文件超参数对读取。	[link]
Voice Activity Detection（VAD）	VAD采用PyWebrct实现Python多进程处理
信噪比（SNR）计算
准确率（Accuracy）计算
插值(interpolate)
文档生成

可视化

功能	效果预览图	代码链接
绘制语谱图(spectrogram)	550px
绘制3D语谱图(3D-spectrogram)	550px

MISC

框架	描述
PyTorch	PyTorch is a Python package that provides two high-level features: Tensor computation (like NumPy) with strong GPU acceleration Deep neural networks built on a tape-based autograd system
PyTorch Lightning	The goal of PyTorch Lightning is "You do the research. Lightning will do everything else". PyTorch Lightning was started by William Falcon while completing his Ph.D. AI research at NYU CILVR and Facebook AI Research, with the vision of making it a foundational part of everyone’s deep learning research code. The framework was designed for professional and academic researchers working in AI, making state of the art AI research techniques, such as TPU training, trivial.
ONNX	Open Neural Network Exchange (ONNX) is an open ecosystem that empowers AI developers to choose the right tools as their project evolves. ONNX provides an open source format for AI models, both deep learning and traditional ML. It defines an extensible computation graph model, as well as definitions of built-in operators and standard data types.
NCNN	ncnn is a high-performance neural network inference computing framework optimized for mobile platforms. ncnn is deeply considerate about deployment and uses on mobile phones from the beginning of design. ncnn does not have third party dependencies. it is cross-platform, and runs faster than all known open source frameworks on mobile phone cpu. Developers can easily deploy deep learning algorithm models to the mobile platform by using efficient ncnn implementation, create intelligent APPs, and bring the artificial intelligence to your fingertips. ncnn is currently being used in many Tencent applications, such as QQ, Qzone, WeChat, Pitu and so on.
YACS	YACS was created as a lightweight library to define and manage system configurations, such as those commonly found in software designed for scientific experimentation. These "configurations" typically cover concepts like hyperparameters used in training a machine learning model or configurable model hyperparameters, such as the depth of a convolutional neural network.
Sphinx	Sphinx is a tool that makes it easy to create intelligent and beautiful documentation, written by Georg Brandl and licensed under the BSD license.

参考