Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion

BLOGS

Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion

Kumud Tripathi∗, Chowdam Venkata Kumar∗, Pankaj Wasnik

30^th September 2024

Kumud Tripathi summarises paper titled Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion co-authored by Chowdam Venkata Kumar, and Pankaj Wasnik accepted in Main Track at the 26th edition of the Interspeech 2025 Conference | August 17-21, 2025.

Introduction

Voice Activity Detection (VAD) identifies speech segments in audio and plays a crucial role in enhancing speech technologies like ASR, speaker recognition, and virtual assistants. Traditional VAD approaches, relying on acoustic features such as MFCC and energy thresholds, struggle in noisy, real-world settings. Modern deep learning methods using CNNs and RNNs improve robustness but depend heavily on large labeled datasets. Recently, pre-trained models (PTMs) like wav2vec 2.0, HuBERT, WavLM, UniSpeech, MMS, and Whisper have shown promise by leveraging vast unlabeled data to learn robust speech representations. Although PTMs have demonstrated success in VAD through fine-tuning, little is known about why they work well or how they compare with traditional features like MFCC. This work conducts a detailed comparison and proposes feature fusion strategies. Results on datasets such as AMI and VoxConverse show that combining MFCC and PTM features, especially via simple methods like concatenation, improves VAD performance over complex fusion techniques.

The key contributions of this work are as follows:

Examines the role of attention mechanism in feature fusion for Voice Activity Detection and shows that attention-based fusion is not always necessary for effective speech and non-speech classification.
Introduces a simple yet effective feature fusion method that combines MFCC and PTM representations.
Conducts a comprehensive analysis of state-of-the-art PTMs to evaluate their effectiveness for VAD.
Demonstrates that addition-based feature fusion enhances both accuracy and computational efficiency.

Fig. 1: Overview of the FusionVAD Framework with Different Feature Fusion Strategies.

Results:

Performance of Voice activity detection with and without feature fusion is shown in Table 1. From the results, it shows that Whisper outperforms other models in VAD. MFCCs have higher FAR but lower MR than most PTMs, suggesting they capture noisy speech, while PTMs reduce FAR but miss some speech. This indicates MFCC and PTM features carry complementary information that can enhance performance when fused.

Table 1: Performance (in %) of Voice activity detection with and without feature fusion. *Bold represents the best result.

Table 2: Comparison (in %) of best performing fusion model with baseline Pyannote.

Figure 1 shows that simple fusion methods like concatenation and addition yield speech segment boundaries closer to ground truth, while cross-attention performs inconsistently. This supports the idea that VAD benefits more from lightweight fusion strategies than complex, parameter-heavy methods like cross-attention.

Figure 1: Feature fusion outputs (Green: Addition, Red: Concatenation, and Purple: Cross-Attention) along with the original reference (Yellow) for all FusionVAD models on a single audio segment from the AMI file ”EN2004a”.

Conclusion

This study shows that combining MFCC and PTM features using simple fusion methods like addition and concatenation significantly improves VAD performance. Addition performs best in most cases, while cross-attention adds complexity without benefit. The best fusion model outperforms Pyannote by 2.04% DER, proving lightweight methods are both effective and efficient. These results suggest that simple strategies can enhance performance in other speech tasks without increasing computational cost.

Citation

@article{tripathi2025attention,

title={Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion},

author={Tripathi, Kumud and Kumar, Chowdam Venkata and Wasnik, Pankaj},

journal={arXiv preprint arXiv:2506.01365},

year={2025}

}

To know more about Sony Research India’s Research Publications, visit the ‘Publications’ section on our ‘Open Innovation’s page: Open Innovation with Sony R&D – Sony Research India

In most of the cases, it has been found that Content Driven sessions outperform the time driven sessions. The results are obtained on 6 baselines: STAMP, NARM, GRU4Rec, CD-HRNN, Tr4Rec on datasets like Movielens (Movies), GoodRead Book, LastFM (Music), Amazon (e-commerce).

The introduced modules and techniques help the proposed method to align known class representations effectively so that it can detect the unknown objects accurately. To validate this, we carried out extensive experiments & ablation studies and found that the proposed method outperforms existing SOTA methods with significant improvement on the MS-COCO & PASCAL VOC dataset for the OSOD task.

To know more about the paper, visit: Open-Set Object Detection by Aligning Known Class Representations (thecvf.com)

To know more about Sony Research India’s Research Publications, visit the ‘Publications’ section on our ‘Open Innovation’s page: Open Innovation with Sony R&D – Sony Research India