Hi, everyone. Thanks for joining our session. Today I'll be presenting on test-time adaptation toward personalized speech enhancement: zero shot learning with knowledge distillation. My name is Sunwoo Kim, and I'm advised under professor Minje Kim at Indiana university. This material is based upon work supported by the national science foundation.
Recent advances in deep learning based speech enhancement models have been showing great performance improvements. These models are typically designed to be large, complex, and trained on a large training data set such that they can generalize well to various test time conditions, including different speakers, noises, and signal to noise ratios of the added noise.
However, these large models do not fit on small devices and their computational complexity can be too burdensome. Model compression methods can help reduce the size, but this entails noise from compression and consequently, there will be some degradation in performance. We can always design small models that can easily fit onto the devices, but then these models won't generalize well.
So we present a personalized speech enhancement solution to adapt small models, to the specific speakers and their acoustic contexts of the test time environment. We targeted practical use cases, for example, a family owned smart assistant device sitting in a living room where it suffices for the enhancement model to perform well only for this specific test environment. By focusing on a small subset of test time data, the small models can update to improve the performance. But to personalize, it would require personal data to use as ground-truth targets for the fine-tuning objective function.
And this is non-trivial as it can be challenging to obtain the required user information due to concern for privacy, because people don't want to share their speech, especially clean utterances. Also, even with user compliance, the recordings obtained during the device enrollment phase can be contaminated with existing background noise.
And another potential issue is that the provided recordings could simply not be long enough. So in this study, we propose an alternative solution that achieves personalization via zero shot learning, meaning it operates without obtaining ground truth, clean speech data. We present this personalization method based on the knowledge distillation framework that does not asked for private signals from the user while it can still adapt to the user speech and recording environment.
Given noisy test time data, we can directly use a teacher models outputs as ground truth targets, to personalize a student model. We'll explain in more detail in the next sections, but first we'll mention some related works. There are other personalization methods that's been explored through various approaches.
One such example is our Interspeech 2021 paper that utilizes a self supervised learning strategy that makes use of widely available test time, noisy recordings to make up for limited clean speech sources. The pseudo se speech enhancement. The pseudo SE is a framework that enables the use of noisy speech. It begins with an already noisy speech S Tilde and treats it as if it's the clean speech target by mixing it up with an additional noise source N.
It can learn some pseudo se function, which works to some degree, but it can be far from the actual enhancement problem if the pseudo target S Tilde isn't clean enough. To remedy this issue. We propose a data purification approach that estimates the cleanness of the frames based on a nice in our predictor, which is represented in this graph as a probability P.
During the personalization phase, the model is trained to focus more on the cleaner frames from which the pseudo se method learns more powerful self supervised features, and eventually for the successfully estimated clean frames, we expect that the pseudo se process behaves like a supervised enhancement model learned from clean personal utterances.
Another example is personalization through an ensemble of specialty sub networks or local experts. First, all of the sub networks are independently trained to specialize on targeted latent spaces, such as speaker identity SNR levels, denoting signal degradation, gender, and so on.
During test time and auxiliary network selects the best specialist for the test time input signal, thereby reducing computational complexity. Some drawbacks of this approach is that the total memory occupancy is not reduced and that the adaptation is limited to the predefined sub-problems.
Now we'll present our proposed method. First we pretrain pre-train a large teacher model T using a large-scale speech corpus and noise dataset in a typical, fully supervised learning framework. And once it's trained, we freeze the teacher model and its parameters are no longer updated after this step. The teacher model is designed to be large, so it generalizes very well, but again, it can be too big.
Next, we also, pre-train a small efficient student model S on the same dataset. Due to its small architecture, we expect it to make mistakes and not generalize well. And ideally we would like to fix its mistakes and improve this compact student model, which is tricky because we do not have to ground truth labels to fine tune on during test time.
So for a test time adaptation to complement these missing labels, we use the teacher models' estimates as pseudo targets to fine tune the student. It can be expected that the larger teacher model will out perform the student models on the test signals due to its large computational capacity and working under zero shot condition we assume that having these synthesized pseudo targets is better than nothing. It's entirely possible for the teacher's estimates to be imperfect, they can contain noising artifacts, but these are still better than the student's estimations. So we assume that the student will learn from these imperfect targets and be able to adapt to its test time conditions to improve its enhancement performance.
And if the teacher's predictions are near perfect, our knowledge distillation framework can become similar to supervised personalization. For the actual use case only the student model is ultimately used for inference on the device. And we envisioned the teacher model to be placed externally on a cloud server where the actual fine tuning operations are conducted.
So the large memory and computation of the teacher model is not going to be burdensome. And then the parameters of the updated student model can then be transferred to the user device. And given its small size, the transfer is not burdensome either. And another use case is where the teacher model can also reside on the device and then the fine tuning procedure can occur during the device idle time.
Well now describe the experimental setup, starting with the model descriptions. Most of our enhancement models are based on the uni-directional gated recurrent unit, GRU architecture among a wide array of model candidates say other temporal convolutional networks or densely connected recurrent networks, and so on, we deliberately chose the RNN for their simple, yet effective architecture and their success in sequential and temporal processing tasks and also their computational complexity are much lower compared to convolutional architectures. Our models are designed to take frequency domain STFT inputs, and then there's a dense layer that transforms the GRU;s output into ideal ratio masks. For the student model that GRU architectures are fixed with two hidden layers. And we varied the hidden units from 32 to 1024. As for the teacher model, we use a three by 1024 GRU architecture chosen to be large enough to outperform the students. And in addition to the larger architecture, we also employ ConvTasNet as an alternate teacher model.
The model size and competition complexity is presented in the table below. For the GRU models. you can see that increasing the hidden units and number of layers quickly increases the complexity of the model. And we can also see that the ConvTasNet teacher model has substantially less number parameters than the GRU model, but due to its extensive convolutional operations, the computational complexity is much higher
And next we'll describe the datasets use for the pre-training and then the test time fine-tuning stages. For pre-training. We use clean speech recordings from the Librispeech Corpus and noise recordings from the MUSAN dataset. We use Librispeeches, train clean hundred for training and dev clean for validation.
We split Musan's free sound subset into training and validation partitions at 80 20 ratio. And the noisy mixtures are obtained by adding the noise to speech signals at random input SNR levels, uniformly chosen between minus five and 10 decibels. Four tests time fine tuning. We use 30 speakers from Librispeech's test clean subset, and noise from the WHAM Corpus.
From these sets we create 30 unique test environments for fine tuning. Here we define an environment by assigning a unique noise location to each speaker. And we create mixtures from each environment by using the clean speech signals from one designated speaker and adding noises from one designated noise location.
So for each test environment, then we split the clean speech and noise data sets into separate sets for fine tuning, validation and testing. These partitions are approximately five minutes of clean speech for fine tuning, one minute for validation and another one minute for testing. And the noise data sets are prepared similarly. This is emphasized so to show that this ensures that we use unseen samples to test the final performance of the fine tuned personalization system.
So here we present the results and discussions. The box plot shows the enhancement performances of various models under environments synthesized from the zero decibel input noise level. The results are shown for pre-trained and fine tuned student models, as well as the teacher models for reference. First, we'll introduce the notations.
The pre-trained student model is denoted with script S. The teacher models are script T with the subscripts denoting, the network architecture, GRU for the three by 1024, GRU architecture and CTN for ConvTasNet, and the fine tuned student models that denoted with the Tilde to differentiate with it's pre-trained version.
And they have subscripts to indicate what it learned from. So tilde S GRU is a student model, fine tuned on the GRU teachers outputs and likewise for the ConvTasNet. From the box plot, we can see that our proposed personalization framework consistently improves all pre-trained student models for various model sizes.
There is distinct improvement shown from the fine tune models. And we can observe that the personalized models learn from the ConvTasNet teacher always outperforms their corresponding ones, fine tuned using the GRU teacher. This showcases that the quality of the teacher models performance is related to the performance of fine tuning and that the structural discrepancy between student and teacher GRU and ConTasNet is not an issue.
We'll also point out that the smaller fine tuned student models can outperform a larger generalist. For instance, the fine tune two by 128 model from the ConvTasN teacher outperforms the two by 1024 generalist. And from the table of model complexity that we've seen before, we can see that the two by 128 model is about 30 times lower in number of MACs and parameters compared to the two by 1024 version.
And this shows that instead of increasing generalist architectures to produce better generalization, it is more advantageous to personalize a models instead. And this verifies that personalized speech and is model commodity compression method, and it can be a preferred direction over applying generic compression methods.
To conclude, in this study, we proposed the knowledge distillation based zero shot learning framework for personalized speech enhancement.
And in this framework, we utilize the teacher's estimates as targets, which otherwise do not exist during test time. And from our experiments, we showed that the student models performance greatly improves on a specific test time speaker and their acoustic environment. Our personalized student models give superior performances to large generalists models, demonstrating that our framework is another mode of model compression that this not sacrifice model performance. Our framework is simple and works end to end. And so we expect that our framework can provide improvements on the different data or loss functions and even be applicable to other domains.
Thank you for listening. For a demo and source codes, please refer to our group's website. Thank you.