Sunwoo_TestTimeAdaptation

By Sunwoo Kim

September 29, 2021

0:00 / 14:36

Hi, everyone. Thanks for joining our session. Today I'll be presenting on test-time adaptation toward personalized speech enhancement: zero shot learning with knowledge distillation. My name is Sunwoo Kim, and I'm advised under professor Minje Kim at Indiana university. This material is based upon work supported by the national science foundation.

Recent advances in deep learning based speech enhancement models have been showing great performance improvements. These models are typically designed to be large, complex, and trained on a large training data set such that they can generalize well to various test time conditions, including different speakers, noises, and signal to noise ratios of the added noise.

However, these large models do not fit on small devices and their computational complexity can be too burdensome. Model compression methods can help reduce the size, but this entails noise from compression and consequently, there will be some degradation in performance. We can always design small models that can easily fit onto the devices, but then these models won't generalize well.

So we present a personalized speech enhancement solution to adapt small models, to the specific speakers and their acoustic contexts of the test time environment. We targeted practical use cases, for example, a family owned smart assistant device sitting in a living room where it suffices for the enhancement model to perform well only for this specific test environment. By focusing on a small subset of test time data, the small models can update to improve the performance. But to personalize, it would require personal data to use as ground-truth targets for the fine-tuning objective function.

And this is non-trivial as it can be challenging to obtain the required user information due to concern for privacy, because people don't want to share their speech, especially clean utterances. Also, even with user compliance, the recordings obtained during the device enrollment phase can be contaminated with existing background noise.

And another potential issue is that the provided recordings could simply not be long enough. So in this study, we propose an alternative solution that achieves personalization via zero shot learning, meaning it operates without obtaining ground truth, clean speech data. We present this personalization method based on the knowledge distillation framework that does not asked for private signals from the user while it can still adapt to the user speech and recording environment.

Given noisy test time data, we can directly use a teacher models outputs as ground truth targets, to personalize a student model. We'll explain in more detail in the next sections, but first we'll mention some related works. There are other personalization methods that's been explored through various approaches.

One such example is our Interspeech 2021 paper that utilizes a self supervised learning strategy that makes use of widely available test time, noisy recordings to make up for limited clean speech sources. The pseudo se speech enhancement. The pseudo SE is a framework that enables the use of noisy speech. It begins with an already noisy speech S Tilde and treats it as if it's the clean speech target by mixing it up with an additional noise source N.

It can learn some pseudo se function, which works to some degree, but it can be far from the actual enhancement problem if the pseudo target S Tilde isn't clean enough. To remedy this issue. We propose a data purification approach that estimates the cleanness of the frames based on a nice in our predictor, which is represented in this graph as a probability P.

During the personalization phase, the model is trained to focus more on the cleaner frames from which the pseudo se method learns more powerful self supervised features, and eventually for the successfully estimated clean frames, we expect that the pseudo se process behaves like a supervised enhancement model learned from clean personal utterances.

Auto-scroll