Deep Learning Approaches for Multimodal Emotion Recognition: Trends, Issues, and Prospects
Abstract
High-fidelity human-computer interaction now hinges on the machine’s ability to decode affective states—a task where single-source data (speech or text alone) consistently falls short. This paper pivots away from traditional unimodal constraints, centering instead on Multimodal Emotion Recognition (MER) as the primary vehicle for context-aware intelligence. We provide a rigorous deconstruction of current deep learning frameworks, specifically interrogating how different fusion topologies and network architectures handle cross-modal interference. Beyond a simple performance review, this survey exposes the “cracks” in state-of-the-art systems: the persistent struggle with cultural bias, the scarcity of high-quality labels, and the inherent opacity of deep-stack models. Our findings suggest that the next frontier lies not in larger models, but in “low-rank” self-supervised learning and Explainable AI (XAI). By prioritizing these lean, transparent methodologies, the field can finally move toward emotion-aware tech that is both ethically robust and deployable on the edge.
Keywords
Deep Learning
Transformers
Fusion
Affective Computing