Narjis Asad, Nihar Ranjan Sahoo, et al.
ACL 2025
Dialog history enhances downstream classification performance in both speech and text based dialog systems. However, there still exists a gap in dialog history integration in a fully end-to-end (E2E) spoken dialog system (SDS) versus a textual dialog system. Text-based dialog systems use large language models (LLMs) to encode long-range dependencies by attending to the entire conversation as a contiguous token sequence. This is not possible in an E2E SDS, as speech sequences can be intractably long. We propose a convolution subsampling approach to make the speech sequence of a conversation tractable and use a conformer to attend to the speech-based conversation in a fine-grained manner. This model is further enhanced via a conversation-level knowledge transfer from a LLM using a token-level alignment strategy. Finetuning the E2E model pretrained this way gives significant gains, of up to 8%, over strong non-contextual baselines in the E2E dialog act classification task on two datasets.
Narjis Asad, Nihar Ranjan Sahoo, et al.
ACL 2025
Amar Prakash Azad, Supriyo Ghosh, et al.
IAAI 2022
Pin-Yu Chen, Chao-han Huck Yang, et al.
INTERSPEECH 2023
Michelle Brachman, Christopher Bygrave, et al.
AAAI 2022