Fusing Audio and Text Features from Earnings Calls Enhances Market Sentiment Prediction

Yongbin Yang; Mengdie Wang; Jingyun Yang

doi:10.55220/2576-6759.929

Authors

Yongbin Yang University of Southern California, United States.
Mengdie Wang Shanghai Lixin University of Accounting and Finance, China.
Jingyun Yang Carnegie Mellon University, United States.

DOI:

https://doi.org/10.55220/2576-6759.929

Keywords:

Audio-text fusion, Cross-attention, Deep learning, Earnings calls, Financial NLP, Multimodal sentiment analysis, Speech representation learning, Stock market prediction.

Abstract

Earnings calls (ECs) represent a critical corporate disclosure channel that simultaneously conveys explicit textual content and implicit acoustic signals carrying distinct informational value for financial markets. This paper presents a comprehensive review of methodologies that fuse audio and text features from ECs to enhance market sentiment prediction. We survey the progression from unimodal approaches grounded in natural language processing (NLP) or acoustic modeling to state-of-the-art multimodal architectures that jointly leverage transcribed language and raw speech representations. The emergence of large language models (LLMs) such as FinBERT and GPT-based systems, combined with deep learning (DL)-driven automatic speech recognition (ASR) frameworks including wav2vec 2.0 and HuBERT, has substantially elevated the representational capacity available for this prediction task. Cross-attention fusion mechanisms, late fusion strategies, and gated multimodal units that align textual and prosodic representations are critically examined. Empirical evidence from reviewed studies demonstrates that multimodal fusion consistently outperforms unimodal baselines, yielding relative improvements in directional stock return accuracy and volatility forecasting of up to 15 percentage points. Open challenges including data scarcity, speaker diarization errors, and acoustic-transcript temporal misalignment are discussed alongside promising future research directions. This review offers a structured synthesis of the field and identifies the architectural and data infrastructural prerequisites for production-grade multimodal financial sentiment systems.