{"id":11496,"date":"2023-12-13T11:19:39","date_gmt":"2023-12-13T11:19:39","guid":{"rendered":"https:\/\/whiteriversmediasolutions.com\/Sony\/fiducial-focus-augmentation-for-facial-landmark-detection-summarised-copy\/"},"modified":"2023-12-14T12:02:17","modified_gmt":"2023-12-14T12:02:17","slug":"decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition","status":"publish","type":"post","link":"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/","title":{"rendered":"Decoding \u2018Efficient infusion of self-supervised representations in Automatic Speech Recognition\u2019"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"11496\" class=\"elementor elementor-11496\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-cd44eb5 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"cd44eb5\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-9f11b70\" data-id=\"9f11b70\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-215a70e elementor-widget elementor-widget-heading\" data-id=\"215a70e\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">BLOGS<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-28dc161 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"28dc161\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-63cf269\" data-id=\"63cf269\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6837436 elementor-widget elementor-widget-heading\" data-id=\"6837436\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Decoding \u2018Efficient infusion of self-supervised representations in Automatic Speech Recognition\u2019<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9bd1630 elementor-widget elementor-widget-text-editor\" data-id=\"9bd1630\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tBy Darshan Prabhu at Sony Research India\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7a034cb elementor-widget elementor-widget-text-editor\" data-id=\"7a034cb\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t13<sup>th<\/sup> December 2023\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-acbeaeb elementor-widget elementor-widget-text-editor\" data-id=\"acbeaeb\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Darshan Prabhu from the Content Analysis- Audio team summarises the research paper titled <a href=\"https:\/\/neurips2023-enlsp.github.io\/papers\/paper_78.pdf\" target=\"_blank\" rel=\"noopener\">\u2018Efficient infusion of self-supervised representations in Automatic Speech Recognition\u2019<\/a> which he has co-authored with Sai Ganesh Mirishkar and Pankaj Wasnik from Sony Research India.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d95d9a3 elementor-widget elementor-widget-text-editor\" data-id=\"d95d9a3\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>This paper will be presented by the team at the Poster Session at the <a href=\"https:\/\/neurips.cc\/\" target=\"_blank\" rel=\"noopener\">Neural Information Processing Systems (NeurIPS) 3rd Workshop on Efficient Natural Language and Speech Processing<\/a> in December 2023, hosted in New Orleans from 10th-16th December 2023.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0411066 elementor-widget elementor-widget-text-editor\" data-id=\"0411066\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<h4><strong>Summary<\/strong><\/h4>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3cfed37 elementor-widget elementor-widget-text-editor\" data-id=\"3cfed37\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Self-supervised learned (SSL) models such as <a href=\"https:\/\/arxiv.org\/abs\/2006.11477\" target=\"_blank\" rel=\"noopener\">Wav2vec<\/a> and <a href=\"https:\/\/arxiv.org\/abs\/2106.07447\" target=\"_blank\" rel=\"noopener\">HuBERT<\/a> have demonstrated state-of-the-art results on speech-related tasks. Given the effectiveness of such models, it is advantageous to use them in conventional Automatic Speech Recognition (ASR) systems. While some approaches suggest incorporating these models as an encoder or a frontend, training such systems is extremely slow and requires a lot of computation cycles. If you have a limited training budget, an alternative approach is to use only the representations from these SSL models instead of directly using them as part of your neural network. Since the focus now is solely on the representations, they can be extracted beforehand as part of pre-processing (which can be easily done with parallel jobs). This one-time step eliminates the need for the SSL model during training, significantly reducing training time while sacrificing a minimal amount of efficiency. It is worth noting that this approach is not new and has already been explored in the NLP setting, where representations from models like BERT are employed in the context of neural machine translation (<a href=\"https:\/\/arxiv.org\/abs\/2002.06823\" target=\"_blank\" rel=\"noopener\">[2002.06823] Incorporating BERT into Neural Machine Translation<\/a>).<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ede4406 elementor-widget elementor-widget-text-editor\" data-id=\"ede4406\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<img decoding=\"async\" class=\"size-medium wp-image-11507 aligncenter lazyload\" data-src=\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/12\/image.0.png\" alt=\"\" width=\"720\" height=\"auto\" data-srcset=\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/12\/image.0.png 506w, https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/12\/image.0-300x59.png 300w\" data-sizes=\"(max-width: 506px) 100vw, 506px\" src=\"data:image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\" style=\"--smush-placeholder-width: 506px; --smush-placeholder-aspect-ratio: 506\/99;\" \/>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1fc116e elementor-widget elementor-widget-text-editor\" data-id=\"1fc116e\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><em>Figure 1<\/em><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f826115 elementor-widget elementor-widget-text-editor\" data-id=\"f826115\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn this work, we propose an end-to-end ASR architecture that efficiently integrates self-supervised model representations into the speech encoder. We accomplish this with a fusion layer that ranges from a simple framewise addition to a more complex cross-attention mechanism. Our complete architecture is illustrated in Figure 1. \n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-c0518a1 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"c0518a1\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-50 elementor-top-column elementor-element elementor-element-b15be70\" data-id=\"b15be70\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-94e5327 elementor-widget elementor-widget-image\" data-id=\"94e5327\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"186\" height=\"135\" data-src=\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/12\/image.1.png\" class=\"attachment-full size-full wp-image-11518 lazyload\" alt=\"\" style=\"--smush-placeholder-width: 186px; --smush-placeholder-aspect-ratio: 186\/135;width:100%;height:72.58%;max-width:186px\" src=\"data:image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-388fca8 elementor-widget elementor-widget-text-editor\" data-id=\"388fca8\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><em>Figure 2<\/em><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t<div class=\"elementor-column elementor-col-50 elementor-top-column elementor-element elementor-element-7c22261\" data-id=\"7c22261\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5946472 elementor-widget elementor-widget-image\" data-id=\"5946472\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"186\" height=\"135\" data-src=\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/12\/image.2.png\" class=\"attachment-full size-full wp-image-11519 lazyload\" alt=\"\" style=\"--smush-placeholder-width: 186px; --smush-placeholder-aspect-ratio: 186\/135;width:100%;height:72.58%;max-width:186px\" src=\"data:image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-26cbd69 elementor-widget elementor-widget-text-editor\" data-id=\"26cbd69\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><em>Figure 3<\/em><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-9b69060 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"9b69060\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-cfbe302\" data-id=\"cfbe302\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-7132bf0 elementor-widget elementor-widget-text-editor\" data-id=\"7132bf0\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>For the fusion layer, we have explored two straightforward approaches, as depicted in Figures 2 and 3. Figure 2 provides an overview of Framewise addition-based fusion, which capitalizes on the linear relationship between the lengths of the representations. It uses subsampling to ensure that both representations are of equal length before performing frame-level addition. On the other hand, Figure 3 demonstrates the utilization of cross-attention to merge the representations. This approach is not dependent on the lengths of the representations and can accommodate representations of any size. Further information regarding these approaches can be found in our <a href=\"https:\/\/neurips2023-enlsp.github.io\/papers\/paper_78.pdf\" target=\"_blank\" rel=\"noopener\">paper<\/a>.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-4f56e5d elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"4f56e5d\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-8b22249\" data-id=\"8b22249\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-362ed11 elementor-widget elementor-widget-text-editor\" data-id=\"362ed11\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<h4>Key Findings<\/h4>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-aa7ea74 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"aa7ea74\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-50 elementor-top-column elementor-element elementor-element-d2c37fd\" data-id=\"d2c37fd\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-f277bae elementor-widget elementor-widget-image\" data-id=\"f277bae\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"207\" height=\"156\" data-src=\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/12\/image.3.png\" class=\"attachment-full size-full wp-image-11520 lazyload\" alt=\"\" style=\"--smush-placeholder-width: 207px; --smush-placeholder-aspect-ratio: 207\/156;width:100%;height:75.36%;max-width:207px\" src=\"data:image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-915c55d elementor-widget elementor-widget-text-editor\" data-id=\"915c55d\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><em>Figure 4<\/em><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t<div class=\"elementor-column elementor-col-50 elementor-top-column elementor-element elementor-element-10f1225\" data-id=\"10f1225\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-cfb8671 elementor-widget elementor-widget-image\" data-id=\"cfb8671\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"270\" height=\"156\" data-src=\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/12\/image.4.png\" class=\"attachment-full size-full wp-image-11521 lazyload\" alt=\"\" style=\"--smush-placeholder-width: 270px; --smush-placeholder-aspect-ratio: 270\/156;width:100%;height:57.78%;max-width:270px\" src=\"data:image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-46abef4 elementor-widget elementor-widget-text-editor\" data-id=\"46abef4\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><em>Figure 5<\/em><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-85bbfff elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"85bbfff\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-8881599\" data-id=\"8881599\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6d045fb elementor-widget elementor-widget-text-editor\" data-id=\"6d045fb\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>We conducted experiments on Librispeech-100 and Tedlium2 datasets, with different choices for the SSL model. Overall, we found that our approach shows a significant WER reduction, despite only a minor increase in the model&#8217;s size.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-554e7da elementor-widget elementor-widget-text-editor\" data-id=\"554e7da\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAs depicted in Figure 4, the utilization of the SSL representation leads to rapid convergence, and we observe our model outperforming the baseline with only a few epochs of training. \n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9f08046 elementor-widget elementor-widget-text-editor\" data-id=\"9f08046\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tDespite reducing the size of the encoder by 80% (specifically, decreasing the number of encoder layers from 12 to only 2), we find that our model continues to outperform the baseline by a significant margin. \n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3f09b56 elementor-widget elementor-widget-text-editor\" data-id=\"3f09b56\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tFinally, Figure 5 illustrates the attention scores originating from our fusion layer based on cross-attention. It is interesting to note that the attention ultimately focused solely on capturing the nearby information despite having the ability to access the entire representation. This implies that the local context holds greater significance in the context of this fusion. In addition, this cross-attention block can now be utilized to obtain the alignment between both representations.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8d07f89 elementor-widget elementor-widget-text-editor\" data-id=\"8d07f89\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>To know more about Sony Research India\u2019s research publications, visit the \u2018Publications\u2019 section on our \u2018Open Innovation\u2019s page:<\/p><p><a href=\"https:\/\/www.sonyresearchindia.com\/open-innovation\/\">Open Innovation with Sony R&amp;D \u2013 Sony Research India<\/a><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Darshan Prabhu from the Content Analysis- Audio team summarises&#8230;<\/p>\n","protected":false},"author":1,"featured_media":11530,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"elementor_header_footer","format":"standard","meta":{"footnotes":""},"categories":[22,17],"tags":[],"class_list":["post-11496","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-all-blogs","category-technology","entry"],"yoast_head":"\n<title>Decoding \u2018Efficient infusion of self-supervised representations in Automatic Speech Recognition\u2019 - Sony Research India<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Decoding \u2018Efficient infusion of self-supervised representations in Automatic Speech Recognition\u2019 - Sony Research India\" \/>\n<meta property=\"og:description\" content=\"Darshan Prabhu from the Content Analysis- Audio team summarises...\" \/>\n<meta property=\"og:url\" content=\"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/\" \/>\n<meta property=\"og:site_name\" content=\"Sony Research India\" \/>\n<meta property=\"article:published_time\" content=\"2023-12-13T11:19:39+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-12-14T12:02:17+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/12\/Cover-Image.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"380\" \/>\n\t<meta property=\"og:image:height\" content=\"190\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"sri_user@2021\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"sri_user@2021\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/\"},\"author\":{\"name\":\"sri_user@2021\",\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/#\/schema\/person\/589cf1e285a7c37cf0cb9feba7ae4338\"},\"headline\":\"Decoding \u2018Efficient infusion of self-supervised representations in Automatic Speech Recognition\u2019\",\"datePublished\":\"2023-12-13T11:19:39+00:00\",\"dateModified\":\"2023-12-14T12:02:17+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/\"},\"wordCount\":632,\"publisher\":{\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/#organization\"},\"image\":{\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/12\/Cover-Image.jpg\",\"articleSection\":[\"All Blogs\",\"Technology\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/\",\"url\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/\",\"name\":\"Decoding \u2018Efficient infusion of self-supervised representations in Automatic Speech Recognition\u2019 - Sony Research India\",\"isPartOf\":{\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/12\/Cover-Image.jpg\",\"datePublished\":\"2023-12-13T11:19:39+00:00\",\"dateModified\":\"2023-12-14T12:02:17+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/#primaryimage\",\"url\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/12\/Cover-Image.jpg\",\"contentUrl\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/12\/Cover-Image.jpg\",\"width\":380,\"height\":190},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Decoding \u2018Efficient infusion of self-supervised representations in Automatic Speech Recognition\u2019\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/#website\",\"url\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/\",\"name\":\"Sony Research India\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/#organization\",\"name\":\"sonyresearchindia\",\"url\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/03\/Sony_Logo.png\",\"contentUrl\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/03\/Sony_Logo.png\",\"width\":168,\"height\":31,\"caption\":\"sonyresearchindia\"},\"image\":{\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/#\/schema\/person\/589cf1e285a7c37cf0cb9feba7ae4338\",\"name\":\"sri_user@2021\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/e0c9edcfb42567c720cc449d4b1e0812298e8172a5a7e4296127a0adba7e705b?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/e0c9edcfb42567c720cc449d4b1e0812298e8172a5a7e4296127a0adba7e705b?s=96&d=mm&r=g\",\"caption\":\"sri_user@2021\"},\"sameAs\":[\"http:\/\/whiteriversmediasolutions.com\/staging\/SRI\"]}]}<\/script>\n","yoast_head_json":{"title":"Decoding \u2018Efficient infusion of self-supervised representations in Automatic Speech Recognition\u2019 - Sony Research India","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/","og_locale":"en_US","og_type":"article","og_title":"Decoding \u2018Efficient infusion of self-supervised representations in Automatic Speech Recognition\u2019 - Sony Research India","og_description":"Darshan Prabhu from the Content Analysis- Audio team summarises...","og_url":"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/","og_site_name":"Sony Research India","article_published_time":"2023-12-13T11:19:39+00:00","article_modified_time":"2023-12-14T12:02:17+00:00","og_image":[{"width":380,"height":190,"url":"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/12\/Cover-Image.jpg","type":"image\/jpeg"}],"author":"sri_user@2021","twitter_card":"summary_large_image","twitter_misc":{"Written by":"sri_user@2021","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/#article","isPartOf":{"@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/"},"author":{"name":"sri_user@2021","@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/#\/schema\/person\/589cf1e285a7c37cf0cb9feba7ae4338"},"headline":"Decoding \u2018Efficient infusion of self-supervised representations in Automatic Speech Recognition\u2019","datePublished":"2023-12-13T11:19:39+00:00","dateModified":"2023-12-14T12:02:17+00:00","mainEntityOfPage":{"@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/"},"wordCount":632,"publisher":{"@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/#organization"},"image":{"@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/#primaryimage"},"thumbnailUrl":"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/12\/Cover-Image.jpg","articleSection":["All Blogs","Technology"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/","url":"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/","name":"Decoding \u2018Efficient infusion of self-supervised representations in Automatic Speech Recognition\u2019 - Sony Research India","isPartOf":{"@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/#website"},"primaryImageOfPage":{"@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/#primaryimage"},"image":{"@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/#primaryimage"},"thumbnailUrl":"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/12\/Cover-Image.jpg","datePublished":"2023-12-13T11:19:39+00:00","dateModified":"2023-12-14T12:02:17+00:00","breadcrumb":{"@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/#primaryimage","url":"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/12\/Cover-Image.jpg","contentUrl":"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/12\/Cover-Image.jpg","width":380,"height":190},{"@type":"BreadcrumbList","@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/decoding-efficient-infusion-of-self-supervised-representations-in-automatic-speech-recognition\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/whiteriversmediasolutions.com\/Sony\/"},{"@type":"ListItem","position":2,"name":"Decoding \u2018Efficient infusion of self-supervised representations in Automatic Speech Recognition\u2019"}]},{"@type":"WebSite","@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/#website","url":"https:\/\/whiteriversmediasolutions.com\/Sony\/","name":"Sony Research India","description":"","publisher":{"@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/whiteriversmediasolutions.com\/Sony\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/#organization","name":"sonyresearchindia","url":"https:\/\/whiteriversmediasolutions.com\/Sony\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/#\/schema\/logo\/image\/","url":"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/03\/Sony_Logo.png","contentUrl":"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/03\/Sony_Logo.png","width":168,"height":31,"caption":"sonyresearchindia"},"image":{"@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/#\/schema\/person\/589cf1e285a7c37cf0cb9feba7ae4338","name":"sri_user@2021","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/e0c9edcfb42567c720cc449d4b1e0812298e8172a5a7e4296127a0adba7e705b?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/e0c9edcfb42567c720cc449d4b1e0812298e8172a5a7e4296127a0adba7e705b?s=96&d=mm&r=g","caption":"sri_user@2021"},"sameAs":["http:\/\/whiteriversmediasolutions.com\/staging\/SRI"]}]}},"_links":{"self":[{"href":"https:\/\/whiteriversmediasolutions.com\/Sony\/wp-json\/wp\/v2\/posts\/11496","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/whiteriversmediasolutions.com\/Sony\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/whiteriversmediasolutions.com\/Sony\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/whiteriversmediasolutions.com\/Sony\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/whiteriversmediasolutions.com\/Sony\/wp-json\/wp\/v2\/comments?post=11496"}],"version-history":[{"count":30,"href":"https:\/\/whiteriversmediasolutions.com\/Sony\/wp-json\/wp\/v2\/posts\/11496\/revisions"}],"predecessor-version":[{"id":11535,"href":"https:\/\/whiteriversmediasolutions.com\/Sony\/wp-json\/wp\/v2\/posts\/11496\/revisions\/11535"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/whiteriversmediasolutions.com\/Sony\/wp-json\/wp\/v2\/media\/11530"}],"wp:attachment":[{"href":"https:\/\/whiteriversmediasolutions.com\/Sony\/wp-json\/wp\/v2\/media?parent=11496"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/whiteriversmediasolutions.com\/Sony\/wp-json\/wp\/v2\/categories?post=11496"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/whiteriversmediasolutions.com\/Sony\/wp-json\/wp\/v2\/tags?post=11496"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}