
{"id":13928,"date":"2025-09-05T07:12:27","date_gmt":"2025-09-05T07:12:27","guid":{"rendered":"https:\/\/whiteriversmediasolutions.com\/Sony\/summarizing-rewind-speech-time-reversal-for-enhancing-speaker-representations-in-diffusion-based-voice-conversion-copy\/"},"modified":"2025-09-05T08:32:12","modified_gmt":"2025-09-05T08:32:12","slug":"attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion","status":"publish","type":"post","link":"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/","title":{"rendered":"Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"13928\" class=\"elementor elementor-13928\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-cd44eb5 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"cd44eb5\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-9f11b70\" data-id=\"9f11b70\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-215a70e elementor-widget elementor-widget-heading\" data-id=\"215a70e\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">BLOGS<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-28dc161 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"28dc161\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-63cf269\" data-id=\"63cf269\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6837436 elementor-widget elementor-widget-heading\" data-id=\"6837436\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9bd1630 elementor-widget elementor-widget-text-editor\" data-id=\"9bd1630\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Kumud Tripathi\u2217, Chowdam Venkata Kumar\u2217, Pankaj Wasnik<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7a034cb elementor-hidden-desktop elementor-hidden-tablet elementor-hidden-mobile elementor-widget elementor-widget-text-editor\" data-id=\"7a034cb\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>30<sup>th<\/sup> September 2024<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-454e546 elementor-widget elementor-widget-text-editor\" data-id=\"454e546\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Kumud Tripathi summarises paper titled <a href=\"https:\/\/arxiv.org\/abs\/2506.01365\">Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion<\/a> co-authored by Chowdam Venkata Kumar, and Pankaj Wasnik accepted in Main Track at the<a href=\"https:\/\/www.interspeech2025.org\/home\"> <strong>26th edition of the Interspeech 2025 Conference<\/strong> | <strong>August 17-21, 2025<\/strong><\/a>.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f0a3e28 elementor-widget elementor-widget-text-editor\" data-id=\"f0a3e28\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<h4><strong>Introduction<\/strong><\/h4>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9202657 elementor-widget elementor-widget-text-editor\" data-id=\"9202657\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tVoice Activity Detection (VAD) identifies speech segments in audio and plays a crucial role in enhancing speech technologies like ASR, speaker recognition, and virtual assistants. Traditional VAD approaches, relying on acoustic features such as MFCC and energy thresholds, struggle in noisy, real-world settings. Modern deep learning methods using CNNs and RNNs improve robustness but depend heavily on large labeled datasets. Recently, pre-trained models (PTMs) like wav2vec 2.0, HuBERT, WavLM, UniSpeech, MMS, and Whisper have shown promise by leveraging vast unlabeled data to learn robust speech representations. Although PTMs have demonstrated success in VAD through fine-tuning, little is known about <em>why<\/em> they work well or how they compare with traditional features like MFCC. This work conducts a detailed comparison and proposes feature fusion strategies. Results on datasets such as AMI and VoxConverse show that combining MFCC and PTM features, especially via simple methods like concatenation, improves VAD performance over complex fusion techniques.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-682ac84 elementor-widget elementor-widget-text-editor\" data-id=\"682ac84\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><strong>The key contributions of this work are as follows:<\/strong><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f6c8504 elementor-widget elementor-widget-text-editor\" data-id=\"f6c8504\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li>Examines the role of attention mechanism in feature fusion for Voice Activity Detection and shows that attention-based fusion is not always necessary for effective speech and non-speech classification.<\/li><li>Introduces a simple yet effective feature fusion method that combines MFCC and PTM representations.<\/li><li>Conducts a comprehensive analysis of state-of-the-art PTMs to evaluate their effectiveness for VAD.<\/li><li>Demonstrates that addition-based feature fusion enhances both accuracy and computational efficiency.<\/li><\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a7d1e72 elementor-widget elementor-widget-image\" data-id=\"a7d1e72\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" data-src=\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/elementor\/thumbs\/blog-kumud-rbarwsxjp5lf0jzs4oejxstzgb7r95rrp2l1km0cio.png\" title=\"blog-kumud\" alt=\"blog-kumud\" src=\"data:image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\" class=\"lazyload\" style=\"--smush-placeholder-width: 700px; --smush-placeholder-aspect-ratio: 700\/368;\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c793aaa elementor-widget elementor-widget-text-editor\" data-id=\"c793aaa\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Fig. 1: Overview of the FusionVAD Framework with Different Feature Fusion Strategies.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4364049 elementor-widget elementor-widget-text-editor\" data-id=\"4364049\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><strong>Results: <\/strong><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5f41fc9 elementor-widget elementor-widget-text-editor\" data-id=\"5f41fc9\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Performance of Voice activity detection with and without feature fusion is shown in Table 1. From the results, it shows that Whisper outperforms other models in VAD. MFCCs have higher FAR but lower MR than most PTMs, suggesting they capture noisy speech, while PTMs reduce FAR but miss some speech. This indicates MFCC and PTM features carry complementary information that can enhance performance when fused.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2383b85 elementor-widget elementor-widget-text-editor\" data-id=\"2383b85\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><strong>Table 1: <\/strong>Performance (in %) of Voice activity detection with and without feature fusion. *Bold represents the best result.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7128f76 elementor-widget elementor-widget-image\" data-id=\"7128f76\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" data-src=\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/elementor\/thumbs\/blog-kumud-results-rbarxwep8hjq9crjkh3gokun3iabgpnf9vsrhch9aa.png\" title=\"blog-kumud-results\" alt=\"blog-kumud-results\" src=\"data:image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\" class=\"lazyload\" style=\"--smush-placeholder-width: 500px; --smush-placeholder-aspect-ratio: 500\/109;\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f457432 elementor-widget elementor-widget-text-editor\" data-id=\"f457432\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><strong>Table 2:<\/strong> Comparison (in %) of best performing fusion model with baseline Pyannote.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-762718d elementor-widget elementor-widget-image\" data-id=\"762718d\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" data-src=\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/elementor\/thumbs\/blog-kumud-results-1-rbaryno0qol1m1ny5avn6vz0bojynxnn1mpuedcttq.png\" title=\"blog-kumud-results-1\" alt=\"blog-kumud-results\" src=\"data:image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\" class=\"lazyload\" style=\"--smush-placeholder-width: 500px; --smush-placeholder-aspect-ratio: 500\/51;\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-36de7e0 elementor-widget elementor-widget-text-editor\" data-id=\"36de7e0\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Figure 1 shows that simple fusion methods like concatenation and addition yield speech segment boundaries closer to ground truth, while cross-attention performs inconsistently. This supports the idea that VAD benefits more from lightweight fusion strategies than complex, parameter-heavy methods like cross-attention.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-57bbe2b elementor-widget elementor-widget-image\" data-id=\"57bbe2b\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img fetchpriority=\"high\" decoding=\"async\" width=\"609\" height=\"914\" src=\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2025\/09\/blog-kumud-results-2.png\" class=\"attachment-medium_large size-medium_large wp-image-13934\" alt=\"blog-kumud-results\" srcset=\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2025\/09\/blog-kumud-results-2.png 609w, https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2025\/09\/blog-kumud-results-2-200x300.png 200w\" sizes=\"(max-width: 609px) 100vw, 609px\" style=\"width:100%;height:150.08%;max-width:609px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0639b64 elementor-widget elementor-widget-text-editor\" data-id=\"0639b64\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><strong>Figure 1: <\/strong>Feature fusion outputs (Green: Addition, Red: Concatenation, and Purple: Cross-Attention) along with the original reference (Yellow) for all FusionVAD models on a single audio segment from the AMI file \u201dEN2004a\u201d.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-cd9e70e elementor-widget elementor-widget-text-editor\" data-id=\"cd9e70e\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<h4><strong>Conclusion<\/strong><\/h4>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-487414c elementor-widget elementor-widget-text-editor\" data-id=\"487414c\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>This study shows that combining MFCC and PTM features using simple fusion methods like addition and concatenation significantly improves VAD performance. Addition performs best in most cases, while cross-attention adds complexity without benefit. The best fusion model outperforms Pyannote by 2.04% DER, proving lightweight methods are both effective and efficient. These results suggest that simple strategies can enhance performance in other speech tasks without increasing computational cost.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ef1ef1e elementor-widget elementor-widget-text-editor\" data-id=\"ef1ef1e\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<h4><strong>Citation<\/strong><\/h4>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-817d816 elementor-widget elementor-widget-text-editor\" data-id=\"817d816\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>@article{tripathi2025attention,<\/p><p>\u00a0 title={Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion},<\/p><p>\u00a0 author={Tripathi, Kumud and Kumar, Chowdam Venkata and Wasnik, Pankaj},<\/p><p>\u00a0 journal={arXiv preprint arXiv:2506.01365},<\/p><p>\u00a0 year={2025}<\/p><p>}<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-61b5c1a elementor-widget elementor-widget-text-editor\" data-id=\"61b5c1a\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>To know more about Sony Research India\u2019s Research Publications, visit the \u2018Publications\u2019 section on our \u2018Open Innovation\u2019s page:\u00a0<a href=\"https:\/\/www.sonyresearchindia.com\/open-innovation\/\">Open Innovation with Sony R&amp;D \u2013 Sony Research India<\/a><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0362925 elementor-hidden-desktop elementor-hidden-tablet elementor-hidden-mobile elementor-widget elementor-widget-text-editor\" data-id=\"0362925\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>In most of the cases, it has been found that Content Driven sessions outperform the time driven sessions. The results are obtained on 6 baselines: STAMP, NARM, GRU4Rec, CD-HRNN, Tr4Rec on datasets like Movielens (Movies), GoodRead Book, LastFM (Music), Amazon (e-commerce).<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-c0518a1 elementor-hidden-desktop elementor-hidden-tablet elementor-hidden-mobile elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"c0518a1\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-33 elementor-top-column elementor-element elementor-element-b15be70\" data-id=\"b15be70\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap\">\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t<div class=\"elementor-column elementor-col-33 elementor-top-column elementor-element elementor-element-55dd72b\" data-id=\"55dd72b\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-e06d72d elementor-widget elementor-widget-image\" data-id=\"e06d72d\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"512\" height=\"322\" data-src=\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2024\/02\/19th-Cover-Image-2.png\" class=\"attachment-full size-full wp-image-11786 lazyload\" alt=\"\" data-srcset=\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2024\/02\/19th-Cover-Image-2.png 512w, https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2024\/02\/19th-Cover-Image-2-300x189.png 300w\" data-sizes=\"(max-width: 512px) 100vw, 512px\" style=\"--smush-placeholder-width: 512px; --smush-placeholder-aspect-ratio: 512\/322;width:100%;height:62.89%;max-width:512px\" src=\"data:image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t<div class=\"elementor-column elementor-col-33 elementor-top-column elementor-element elementor-element-fd52b32\" data-id=\"fd52b32\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap\">\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-9b69060 elementor-hidden-desktop elementor-hidden-tablet elementor-hidden-mobile elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"9b69060\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-cfbe302\" data-id=\"cfbe302\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6d045fb elementor-widget elementor-widget-text-editor\" data-id=\"6d045fb\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe introduced modules and techniques help the proposed method to align known class\nrepresentations effectively so that it can detect the unknown objects accurately. To validate\nthis, we carried out extensive experiments &#038; ablation studies and found that the proposed\nmethod outperforms existing SOTA methods with significant improvement on the MS-COCO\n&#038; PASCAL VOC dataset for the OSOD task.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f97c4c4 elementor-widget elementor-widget-text-editor\" data-id=\"f97c4c4\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tTo know more about the paper, visit: <a href=\"https:\/\/openaccess.thecvf.com\/content\/WACV2024\/papers\/Sarkar_Open-Set_Object_Detection_by_Aligning_Known_Class_Representations_WACV_2024_paper.pdf\" target=\"_blank\" rel=\"noopener\">Open-Set Object Detection by Aligning Known Class\nRepresentations (thecvf.com)<\/a>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9e2f9cc elementor-widget elementor-widget-text-editor\" data-id=\"9e2f9cc\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tTo know more about Sony Research India\u2019s Research Publications, visit the \u2018Publications\u2019\nsection on our \u2018Open Innovation\u2019s page: <a href=\"https:\/\/www.sonyresearchindia.com\/open-innovation\/\" target=\"_blank\" rel=\"noopener\">Open Innovation with Sony R&amp;D \u2013 Sony Research\nIndia<\/a>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Kumud Tripathi summarises paper titled Attention&#8230;<\/p>\n","protected":false},"author":1,"featured_media":13938,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"elementor_header_footer","format":"standard","meta":{"footnotes":""},"categories":[22,17],"tags":[],"class_list":["post-13928","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-all-blogs","category-technology","entry"],"yoast_head":"\n<title>Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion - Sony Research India<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion - Sony Research India\" \/>\n<meta property=\"og:description\" content=\"Kumud Tripathi summarises paper titled Attention...\" \/>\n<meta property=\"og:url\" content=\"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/\" \/>\n<meta property=\"og:site_name\" content=\"Sony Research India\" \/>\n<meta property=\"article:published_time\" content=\"2025-09-05T07:12:27+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-09-05T08:32:12+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2025\/09\/Blog-Thumbnail_-Kumud-VAD_-Interspeech2025.png\" \/>\n\t<meta property=\"og:image:width\" content=\"380\" \/>\n\t<meta property=\"og:image:height\" content=\"190\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"sri_user@2021\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"sri_user@2021\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/\"},\"author\":{\"name\":\"sri_user@2021\",\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/#\/schema\/person\/589cf1e285a7c37cf0cb9feba7ae4338\"},\"headline\":\"Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion\",\"datePublished\":\"2025-09-05T07:12:27+00:00\",\"dateModified\":\"2025-09-05T08:32:12+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/\"},\"wordCount\":761,\"publisher\":{\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/#organization\"},\"image\":{\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2025\/09\/Blog-Thumbnail_-Kumud-VAD_-Interspeech2025.png\",\"articleSection\":[\"All Blogs\",\"Technology\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/\",\"url\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/\",\"name\":\"Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion - Sony Research India\",\"isPartOf\":{\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2025\/09\/Blog-Thumbnail_-Kumud-VAD_-Interspeech2025.png\",\"datePublished\":\"2025-09-05T07:12:27+00:00\",\"dateModified\":\"2025-09-05T08:32:12+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/#primaryimage\",\"url\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2025\/09\/Blog-Thumbnail_-Kumud-VAD_-Interspeech2025.png\",\"contentUrl\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2025\/09\/Blog-Thumbnail_-Kumud-VAD_-Interspeech2025.png\",\"width\":380,\"height\":190,\"caption\":\"Blog Thumbnail\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/#website\",\"url\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/\",\"name\":\"Sony Research India\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/#organization\",\"name\":\"sonyresearchindia\",\"url\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/03\/Sony_Logo.png\",\"contentUrl\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/03\/Sony_Logo.png\",\"width\":168,\"height\":31,\"caption\":\"sonyresearchindia\"},\"image\":{\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/#\/schema\/person\/589cf1e285a7c37cf0cb9feba7ae4338\",\"name\":\"sri_user@2021\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/whiteriversmediasolutions.com\/Sony\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/e0c9edcfb42567c720cc449d4b1e0812298e8172a5a7e4296127a0adba7e705b?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/e0c9edcfb42567c720cc449d4b1e0812298e8172a5a7e4296127a0adba7e705b?s=96&d=mm&r=g\",\"caption\":\"sri_user@2021\"},\"sameAs\":[\"http:\/\/whiteriversmediasolutions.com\/staging\/SRI\"]}]}<\/script>\n","yoast_head_json":{"title":"Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion - Sony Research India","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/","og_locale":"en_US","og_type":"article","og_title":"Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion - Sony Research India","og_description":"Kumud Tripathi summarises paper titled Attention...","og_url":"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/","og_site_name":"Sony Research India","article_published_time":"2025-09-05T07:12:27+00:00","article_modified_time":"2025-09-05T08:32:12+00:00","og_image":[{"width":380,"height":190,"url":"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2025\/09\/Blog-Thumbnail_-Kumud-VAD_-Interspeech2025.png","type":"image\/png"}],"author":"sri_user@2021","twitter_card":"summary_large_image","twitter_misc":{"Written by":"sri_user@2021","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/#article","isPartOf":{"@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/"},"author":{"name":"sri_user@2021","@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/#\/schema\/person\/589cf1e285a7c37cf0cb9feba7ae4338"},"headline":"Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion","datePublished":"2025-09-05T07:12:27+00:00","dateModified":"2025-09-05T08:32:12+00:00","mainEntityOfPage":{"@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/"},"wordCount":761,"publisher":{"@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/#organization"},"image":{"@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/#primaryimage"},"thumbnailUrl":"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2025\/09\/Blog-Thumbnail_-Kumud-VAD_-Interspeech2025.png","articleSection":["All Blogs","Technology"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/","url":"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/","name":"Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion - Sony Research India","isPartOf":{"@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/#website"},"primaryImageOfPage":{"@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/#primaryimage"},"image":{"@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/#primaryimage"},"thumbnailUrl":"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2025\/09\/Blog-Thumbnail_-Kumud-VAD_-Interspeech2025.png","datePublished":"2025-09-05T07:12:27+00:00","dateModified":"2025-09-05T08:32:12+00:00","breadcrumb":{"@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/#primaryimage","url":"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2025\/09\/Blog-Thumbnail_-Kumud-VAD_-Interspeech2025.png","contentUrl":"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2025\/09\/Blog-Thumbnail_-Kumud-VAD_-Interspeech2025.png","width":380,"height":190,"caption":"Blog Thumbnail"},{"@type":"BreadcrumbList","@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/attention-is-not-always-the-answer-optimizing-voice-activity-detection-with-simple-feature-fusion\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/whiteriversmediasolutions.com\/Sony\/"},{"@type":"ListItem","position":2,"name":"Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion"}]},{"@type":"WebSite","@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/#website","url":"https:\/\/whiteriversmediasolutions.com\/Sony\/","name":"Sony Research India","description":"","publisher":{"@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/whiteriversmediasolutions.com\/Sony\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/#organization","name":"sonyresearchindia","url":"https:\/\/whiteriversmediasolutions.com\/Sony\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/#\/schema\/logo\/image\/","url":"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/03\/Sony_Logo.png","contentUrl":"https:\/\/whiteriversmediasolutions.com\/Sony\/uvaftoap\/2023\/03\/Sony_Logo.png","width":168,"height":31,"caption":"sonyresearchindia"},"image":{"@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/#\/schema\/person\/589cf1e285a7c37cf0cb9feba7ae4338","name":"sri_user@2021","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/whiteriversmediasolutions.com\/Sony\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/e0c9edcfb42567c720cc449d4b1e0812298e8172a5a7e4296127a0adba7e705b?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/e0c9edcfb42567c720cc449d4b1e0812298e8172a5a7e4296127a0adba7e705b?s=96&d=mm&r=g","caption":"sri_user@2021"},"sameAs":["http:\/\/whiteriversmediasolutions.com\/staging\/SRI"]}]}},"_links":{"self":[{"href":"https:\/\/whiteriversmediasolutions.com\/Sony\/wp-json\/wp\/v2\/posts\/13928","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/whiteriversmediasolutions.com\/Sony\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/whiteriversmediasolutions.com\/Sony\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/whiteriversmediasolutions.com\/Sony\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/whiteriversmediasolutions.com\/Sony\/wp-json\/wp\/v2\/comments?post=13928"}],"version-history":[{"count":9,"href":"https:\/\/whiteriversmediasolutions.com\/Sony\/wp-json\/wp\/v2\/posts\/13928\/revisions"}],"predecessor-version":[{"id":13949,"href":"https:\/\/whiteriversmediasolutions.com\/Sony\/wp-json\/wp\/v2\/posts\/13928\/revisions\/13949"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/whiteriversmediasolutions.com\/Sony\/wp-json\/wp\/v2\/media\/13938"}],"wp:attachment":[{"href":"https:\/\/whiteriversmediasolutions.com\/Sony\/wp-json\/wp\/v2\/media?parent=13928"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/whiteriversmediasolutions.com\/Sony\/wp-json\/wp\/v2\/categories?post=13928"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/whiteriversmediasolutions.com\/Sony\/wp-json\/wp\/v2\/tags?post=13928"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}