Torch speaker

PyTorch Speaker 一个基于 PyTorch 编写的说话人识别科研工具包。

PyTorch Speaker 简介

PyTorch Speaker 是一个基于 PyTorch 编写的说话人识别科研工具包。

项目特点

不依赖于Kaldi，没有使用高级shell语法
支持离线手机端/嵌入式设备的快速部署
丰富的数据可视化支持

说话人识别(Speaker Recognition, SRE)技术，也称为声纹识别(Voiceprint Recognition, VPR)技术属于生物特征识别技术的一种，是一项根据语音信号中反映说话人生理和行为特征的语音参数(“声纹”)，自动识别说话人身份的技术。说话人识别本质上是一类模式识别问题。说话人识别按照具体场景和需求的不同可以分为如下图所示的3个子任务：

700px

任务中文名称	任务英语名称	中文定义	英语定义
说话人辨认	Speaker Identification	判断某段语音是由若干人中的哪一个人所说，是“1 vs N”的判别	Speaker Identification that identify the true speaker from a set of candidates.
说话人确认	Speaker Verification	判断某段语音是否是由指定的某个人所说，是“1 vs 1”的判别	Speaker Verification that tests if an alleged speaker is the true speaker.
说话人追踪	Speaker Diarization		Speaker Diarization which addresses the problem of “who spoken and when”, which is a process of partitioning a conversation recording into several speech recordings, each of which belongs to a single speaker.

系统性能

网络结构	网络参数量	损失函数	是否有数据增强	训练数据集	测试数据集	EER	DCF(10-2)	DCF(10-3)	YAML
resnet34_TSP	10.6M	softmax	NO	voxceleb 1&2 (7205)

项目结构

.
├── config/  # 存放yaml配置文件
├── docs/    # 存放文档
├── README.md
├── requirements.txt
├── scripts/ # 存放数据处理、数据可视化脚本
├── setup.py 
├── tools/   # 存放训练、推理、量化部署等脚本
└── torch_speaker/ # 模型pipline的主体实现
    ├── backbone/
    ├── audio/
    ├── loss/
    ├── score/
    ├── module.py
    └── utils/

快速安装与上手

步骤	使用方法	注意事项
安装	git clone cd torch_speaker pip install -r requirements.txt python setup.py develop
数据准备/预处理	采用pandas构建datlist.csv来实现数据的准备。 python3 scripts/build_datalist.py \ --extension wav \ --dataset_dir data/train \ --data_list_path data/train.csv python3 scripts/format_trials.py \ --voxceleb1_root $voxceleb1_path \ --src_trials_path data/voxceleb1_test_v2.txt \ --dst_trials_path data/trial.lst	语音数据默认为wav格式，采用率为16k。
Training	python3 tools/train.py \ --config config/${yaml_path}
Evaluation
Export

语音声学特征的提取

由于模板:Ic库中存在一定的bug，且未来我计划采用小波分析的方法(wavelet)进行一些其他的实验，因此提取声学体征的代码也是完全采用PyTorch手动实现（没有依赖于其他第三方库）。

生成缩略图出错：/bin/bash: /usr/bin/convert: No such file or directory

Error code: 127

其中，值得注意的是：

kaldi所实现的特征提取是offline的，采用PyTorch可以实现online的特征提取；
Mel-Spectrogram 也叫Fbank，或是FilterBank，在Kaldi中叫Fbank中比较多，在TTS和VC中Mel-Spectrogram叫的比较多；
由我实现的特征提取代码，kaldi，librosa，torchaudio即便是在相同配置参数的情况下，所提取的到的结果都不同。

特征名	实现流程	代码存放位置
Spectrogram	预加重(Pre-emphasis，弥补了高频部分的损耗，保护声道信息）加窗（hamming窗，降低吉布斯现象）并做短时傅里叶变换(stft) 对stft后的复数结果取模取对数（加1e-9防止出现0） Instance Norm(可以等价于做了cmvn倒谱均值方差归一化)	[link]
Mel-Spectrogram	预加重(Pre-emphasis，弥补了高频部分的损耗，保护声道信息）加窗（hamming窗，降低吉布斯现象）并做短时傅里叶变换(stft) 对stft后的复数结果取模取对数（加1e-9防止出现0） Mel滤波 Instance Norm(可以等价于做了cmvn倒谱均值方差归一化)	[link]
MFCC

常用backbone与实现

backbone这个单词原意指的是人的脊梁骨，后来引申为支柱，核心的意思。在神经网络中，尤其是CV领域，一般先对图像进行特征提取（常见的有vggnet，resnet，谷歌的inception），这一部分是整个CV任务的根基，因为后续的下游任务都是基于提取出来的图像特征去做文章（比如分类，生成等等）。所以将这一部分网络结构称为backbone十分形象，仿佛是一个人站起来的支柱。

ResNet和其变种

ResNet是

400px

TDNN和其变种

ECAPA-TDNN architecture is based on the popular x-vector topology and it introduces several enhancements to create more robust speaker embeddings.

300px

The pooling layer uses a channel-and context-dependent attention mechanism, which allows the network to attend different frames per channel. 1-dimensional SqueezeExcitation (SE) blocks rescale the channels of the intermediate frame-level feature maps to insert global context information in the locally operating convolutional blocks. Next, the integration of 1-dimensional Res2-blocks improves performance while simultaneously reducing the total parameter count by using grouped convolutions in a hierarchical way.

Finally, Multi-layer Feature Aggregation (MFA) merges complementary information before the statistics pooling by concatenating the final frame-level feature map with an intermediate feature maps of preceding layers.

The network is trained by optimizing the AAM-softmax loss on the speaker identities in the training corpus. The AAM-softmax is a powerful enhancement compared to the regular softmax loss in the context of fine-grained classification and verification problems. It directly optimizes the cosine distance between the speaker embeddings.

The model turned out to work amazingly well for speaker verification and speaker diarization.

常用pooling layer

Pooling Layer
TSP
TAP
ASP
SAP

常用Loss

Loss	描述	计算公式	代码实现
softmax
Triplet Loss	Triplet Loss基本思路是构造一个三元组，由anchor、positive 和 negative 组成，其中 anchor 和 positive 表示来自于同一个人的不同声音，negative 表示来自不同的人的声音，然后，用大量标注好的三元组作为网络输入，训练DNN参数。其优点在于直接使用embeddings之间的相似度作为优化的成本函数，最大化 anchor 和 positive的相似度，同时最小化 anchor和 negative 的相似度。这样，在提取了说话者的 embedding 之后，说话人识别任务就可以简单地通过相似度计算实现。
AM-softmax	Kaldi搭建的声纹系统在模型训练中大多使用Softmax损失函数，但是由于 Softmax 损失函数并不能增大类内紧凑性和类间分离性，为了增强embedding的判别性
AAM-softmax

后端打分

打分方法	功能描述	计算公式	代码实现
cosine
GPLDA
LDA-GPLDA

训练 trick 相关

data loader

数据增强

musan数据集

训练数据类别均衡

说话人数据类别分布不均衡可能会导致训练的模型效果一般，为此需要在dataloader上使用一些预处理的手段实现说话人类别的均衡。

评价指标计算

ROC 曲线

对于Speaker Verification任务，评估模型的方法可以是绘制ROC曲线，首先了解以下定义

真阳率（True Positive Rate, 模板:Ic）：描述识别出的所有正例占所有正例的比例
假阳率（False Positive Rate, 模板:Ic）：描述将负例识别为正例的情况占所有负例的比例
真阴率（True Negative Rate，模板:Ic）：描述识别出的负例占所有负例的比例
其中其中TPR即为敏感度（sensitivity），TNR即为特异度（specificity）

ROC曲线（Receiver Operating Characteristic curve）将TPR定义为X轴，将FPR定义为Y轴； AUC（Area Under Curve）是指ROC曲线下面积，越接近1表示分类器越好。曲线下面积越大，分类的准确性就越高；最靠近坐标图左上方的点为灵敏性和特异性均较高的临界值。 ROC曲线有个很好的特性：当测试集中的正负样本的分布变化的时候，ROC曲线能够保持不变。

评价指标	计算方法	代码实现
EER
minDCF
Diarization Error Rate (DER)

对抗样本攻击与防御

当AI模型/算法设计之初未考虑相关的安全威胁的情况下，AI算法的判断结果容易被恶意攻击者影响，导致AI系统判断失准。其中最主要的安全威胁是闪避攻击，即是指通过修改输人，让AI模型无法对其正确识别。研究表明深度学习系统容易受到精心设计的输人样本的影响，这些输人样本称为对抗样本（Adversarial Examples）

说话人识别系统，无论是基于深度神经网络的/或是基于传统统计模型的i-vector系统都同样存在这样的问题。

对抗样本攻击

攻击方法	计算方法	代码实现
BIM
PGD

对抗样本防御

工具代码和脚本

工具代脚本

脚本名称	实现思路与流程	代码位置
读取waveform	目前各类开源的工具中，语音数据的读取的方法实现主要有两种：一种是以matlab，soundfile为代表的一种是以kaldi，scipy为代表的在本项目中，根据training和evaluation阶段的不同，对语音的读取策略也有所区别。	[link]
读取超参数	超参数的读入参考了nanodet项目的实现，采用了yacs来实现对yaml文件超参数对读取。	[link]
Voice Activity Detection（VAD）	VAD采用PyWebrct实现Python多进程处理
信噪比（SNR）计算
准确率（Accuracy）计算
插值(interpolate)
文档生成
format trials

可视化

功能	效果预览图	代码链接
绘制语谱图(spectrogram)	550px
绘制3D语谱图(3D-spectrogram)	550px
绘制ROC曲线	生成缩略图出错：/bin/bash: /usr/bin/convert: No such file or directory Error code: 127
绘制PR曲线
绘制混淆矩阵

模型部署

以 PyTorch 和 TensorFlow 为代表的深度学习框架集成了模型的训练和推理两个过程。然而在实际模型的使用中，如果想要在不同类型的平台（云/Edge、CPU/GPU 等）上获得最佳性能，需需要调整模型（量化、知识蒸馏）和使用专门的推理库(ONNX，TensorRT)。

ONNX runtime 是一种用于将 ONNX 模型部署到生产环境的高性能推理引擎。它针对云和 Edge 进行了优化，适用于 Linux、Windows 和 Mac。它使用 C++ 编写，还包含 C、Python、C#、Java 和 Javascript (Node.js) API，可在各种环境中使用。 ONNX 运行时同时支持 DNN 和传统 ML 模型，并与不同硬件上的加速器（例如，NVidia GPU 上的 TensorRT、Intel 处理器上的 OpenVINO、Windows 上的 DirectML 等）集成。通过使用 ONNX 运行时，可以从大量的生产级优化、测试和不断改进中受益。

代码规范

MISC

框架	描述
PyTorch	PyTorch is a Python package that provides two high-level features: Tensor computation (like NumPy) with strong GPU acceleration Deep neural networks built on a tape-based autograd system
PyTorch Lightning	The goal of PyTorch Lightning is "You do the research. Lightning will do everything else". PyTorch Lightning was started by William Falcon while completing his Ph.D. AI research at NYU CILVR and Facebook AI Research, with the vision of making it a foundational part of everyone’s deep learning research code. The framework was designed for professional and academic researchers working in AI, making state of the art AI research techniques, such as TPU training, trivial.
ONNX	Open Neural Network Exchange (ONNX) is an open ecosystem that empowers AI developers to choose the right tools as their project evolves. ONNX provides an open source format for AI models, both deep learning and traditional ML. It defines an extensible computation graph model, as well as definitions of built-in operators and standard data types.
NCNN	ncnn is a high-performance neural network inference computing framework optimized for mobile platforms. ncnn is deeply considerate about deployment and uses on mobile phones from the beginning of design. ncnn does not have third party dependencies. it is cross-platform, and runs faster than all known open source frameworks on mobile phone cpu. Developers can easily deploy deep learning algorithm models to the mobile platform by using efficient ncnn implementation, create intelligent APPs, and bring the artificial intelligence to your fingertips. ncnn is currently being used in many Tencent applications, such as QQ, Qzone, WeChat, Pitu and so on.
YACS	YACS was created as a lightweight library to define and manage system configurations, such as those commonly found in software designed for scientific experimentation. These "configurations" typically cover concepts like hyperparameters used in training a machine learning model or configurable model hyperparameters, such as the depth of a convolutional neural network.
Sphinx	Sphinx is a tool that makes it easy to create intelligent and beautiful documentation, written by Georg Brandl and licensed under the BSD license.

参考