arXiv:2312.04293v1 [cs.CV] 7 Dec 2023 GPT-4V with Emotion: A Zero-shot Benchmark for Multimodal Emotion Understanding Zheng Lian1 , Licai Sun1,2 , Haiyang Sun1,2 , Kang Chen3 , Zhuofan Wen1,2 , Hao Gu1,2 , Shun Chen1,2 , Bin Liu1 , Jianhua Tao4 1 Institute of Automation, Chinese Academy of Sciences 2 School of Artificial Intelligence, University of Chinese Academy of Sciences 3 Peking University, 4 Tsinghua University lianzheng2016@ia.ac.cn Abstract Recently, GPT-4 with Vision (GPT-4V) has shown remarkable performance across various multimodal tasks. However, its efficacy in emotion recognition remains a question. This paper quantitatively evaluates GPT-4V’s capabilities in multimodal emotion understanding, encompassing tasks such as facial emotion recognition, visual sentiment analysis, micro-expression recognition, dynamic facial emotion recognition, and multimodal emotion recognition. Our experiments show that GPT-4V exhibits impressive multimodal and temporal understanding capabilities, even surpassing supervised systems in some tasks. Despite these achievements, GPT-4V is currently tailored for general domains. It performs poorly in microexpression recognition that requires specialized expertise. The main purpose of this paper is to present quantitative results of GPT-4V on emotion understanding and establish a zero-shot benchmark for future research. Code and evaluation results are available at: https://github.com/zeroQiaoba/gpt4v-emotion. 1 Introduction Multimodal emotion understanding has attracted increasing attention from researchers due to its diverse applications. This task aims to integrate multimodal information (i.e., image, video, audio and text) to understand emotions. In September 2023, GPT-4V was integrated into ChatGPT, and a large number of user reports appeared to investigate its visual capabilities [3–6]. However, these tests generally select a limited number of samples for each task, offering qualitative assessments of GPT-4V’s performance. By November, OpenAI released its API, but it was limited to 100 requests per day and was still difficult to test on benchmark datasets. Recently, OpenAI increased the daily limit, which allows us to systematically compare with existing methods. Emotions can be conveyed through multi-modalities. However, the current GPT-4V only supports images and text. For video, we can split it into multiple frames, and GPT-4V is able to capture the temporal information in these frames. For audio, we attempt to convert to mel-spectrograms to capture paralinguistic information. However, GPT-4V refuses to recognize mel-spectrograms. Consequently, our evaluation predominantly focuses on images, videos, and text. This paper evaluates the performance of GPT-4V in multimodal emotion understanding. To the best of our knowledge, this is the first work to quantitatively evaluate the performance of GPT-4V on emotional tasks. We hope that our work can establish a zero-shot benchmark for subsequent research and inspire future directions in affective computing. Preprint. Under review. Table 1: Dataset statistics for different tasks. Task (Modality) Dataset CK+ [8] SFEW 2.0 [9] Facial Emotion Recognition FERPlus [10] (Image) RAF-DB [11] AffectNet [12] Twitter I [13] Visual Sentiment Analysis Twitter II [14] (Image) Abstract [15] ArtPhoto [15] CASME [16] Micro-expression Recognition CASME II [17] (Image) SAMM [18] DFEW (fd1) [19] Dynamic Facial Emotion Recognition FERV39k [20] (Video) RAVDESS [21] eNTERFACE05 [22] MER2023 [23] Multimodal Emotion Recognition CH-SIMS [24] (Video + Text) CMU-MOSI [25] # Test Samples 981 436 3,589 3,068 4,000 1,269 603 228 806 195 247 159 2,341 7,847 1,440 1,287 411 457 686 2 GPT-4V with Emotion This paper evaluates the GPT-4 API (gpt-4-1106-preview). Currently, GPT-4 has three limits on requests: tokens per minute (TPM), requests per minute (RPM), and requests per day (RPD). To satisfy RPM and RPD, we follow previous work [7] and adopt batch input. Let’s take facial expression recognition as an example. Prompts for other tasks can be found in Appendix A. Prompt: Please play the role of a facial expression classification expert. We provide 20 images. Please ignore the speaker’s identity and focus on the facial expression. For each image, please sort the provided categories from high to low according to the similarity with the input. Here are the optional categories: [happy, sad, angry, fearful, disgusted, surprised, neutral]. The output format should be {’name’:, ’result’:} for each image. However, large batch sizes introduce the potential risk of TPM errors and incorrect prediction counts. For example, inputting 30 samples in a batch might produce 28 predictions. To alleviate this problem, we set the batch size to 20 for image-level inputs and 6 for video-level inputs. Multimodal emotion understanding centers around the analysis of human feelings, but analyzing individuals can introduce security errors. To alleviate these errors, we require GPT-4V to ignore the speaker identity. However, security errors persist during the evaluation process, and their occurrence seems arbitrary. For example, although all images are human-centered, some pass security tests while others fail. Alternatively, a sample may initially fail the security test but pass upon retry. It is suggested that future versions of GPT-4V should enhance the consistency of its security checks. 3 Task Description In this paper, we evaluate the zero-shot performance of GPT-4V across five task. Dataset statistic and labeling methods are summarized in Tables 1∼2. Facial Emotion Recognition Recognizing facial emotions is a fundamental task in emotion understanding. We select five benchmark datasets for evaluation. CK+ [8] contains 593 video sequences from 123 subjects. We follow previous work [26] and extract the last three frames of each sequence for emotion recognition. SFEW 2.0 [9] extracts key frames from movie clips, encompassing various head poses, occlusions and illuminations. RAF-DB [11] contains thousands of samples with basic and compound expressions. In this paper, we focus on basic emotions. FERPlus [10] is an extension of FER2013 [27], where researchers relabel each image with 10 individuals. AffectNet [12] contains 2 Dataset CK+ [8] SFEW 2.0 [9] RAF-DB [11] FERPlus [10] AffectNet [12] Twitter I [13] Twitter II [14] Abstract [15] ArtPhoto [15] CASME [16] CASME II [17] SAMM [18] eNTERFACE05 [22] DFEW [19] FERV39k [20] RAVDESS [21] CH-SIMS [24] CMU-MOSI [25] MER2023 [23] Table 2: Labeling methods for different datasets. Labels happy, sad, angry, fearful, disgusted, surprised, contempt happy, sad, angry, fearful, disgusted, surprised, neutral happy, sad, angry, fearful, disgusted, surprised, neutral happy, sad, angry, fearful, disgusted, surprised, neutral, contempt happy, sad, angry, fearful, disgusted, surprised, neutral, contempt positive, negative positive, negative amused, sad, angry, fearful, disgusted, awed, content, excited amused, sad, angry, fearful, disgusted, awed, content, excited happy, sad, fearful, surprised, disgusted, repressed, contempt, tensed happy, sad, fearful, surprised, disgusted, repressed, others happy, sad, fearful, surprised, disgusted, contempt, angry, others happy, sad, angry, fearful, disgusted, surprised happy, sad, angry, fearful, disgusted, surprised, neutral happy, sad, angry, fearful, disgusted, surprised, neutral happy, sad, angry, fearful, disgusted, surprised, neutral, calm sentiment intensity sentiment intensity happy, sad, angry, surprised, neutral, worried over 0.4 millions of labeled images. This dataset has eight emotions, each containing 500 samples for evaluation. Visual Sentiment Analysis In contrast to facial emotion recognition, visual sentiment analysis aims to identify the emotions evoked by images, with no requirement for the images to be humancentric. Twitter I [13] and Twitter II [14] are collected from social websites, each sample is annotated by workers on Amazon Mechanical Turk. Artphoto [15] consists of artistic photographs from a photo-sharing site, while Abstract [15] contains peer-rated abstract paintings. Although both Abstract [15] and Artphoto [15] are annotated into eight categories, for a fair comparison with previous works, we remap these labels into positive and negative sentiments. Micro-expression Recognition Unlike conventional facial emotion recognition, microexpressions have short duration, low intensity, and occur with sparse facial action units [28]. Earlier works showed that micro-expressions can be recognized from a single apex frame [29]. Therefore, we evaluate GPT-4V on micro-expression recognition using the apex frame. For this task, we use the following datasets. CASME [16] contains 195 samples in 8 categories. Our assessment focuses on four major labels: tense, disgust, repression, and surprise. CASME II [17] contains 247 samples collected from 26 subjects, and we focus on five major labels. SAMM [18] consists of 159 samples, and we only evaluate labels that contain more than 10 samples. Dynamic Facial Emotion Recognition Facial emotion recognition typically involves static images as input, whereas dynamic facial emotion recognition extends the analysis to image sequences or videos. Therefore, this task requires further utilization of temporal information. Consistent with previous works, our evaluation metrics include unweighted average recall (UAR) and weighted average recall (WAR). We report results on four benchmark datasets. Among them, DFEW [19] consists of 11,697 samples. Since some works perform experiments on fold 1 (fd1)[30], we only report results on fd1 to reduce the evaluation cost. Multimodal Emotion Recognition Emotions can be expressed through various modalities, and a comprehensive understanding of emotions requires the integration of information from different sources. In all evaluation datasets, CH-SIMS [25] and CMU-MOSI [24] provide sentiment intensity scores for each sample. This paper focuses on the negative/positive classification task. Positive and negative classes are assigned for < 0 and > 0 scores, respectively. MER2023 [23] consists of three subsets: MER-MULTI, MER-NOISE, and MER-SEMI. Our evaluation specifically focuses on the discrete emotion recognition task within the MER-MULTI subset. 3 Table 3: Performance comparison on facial emotion recognition. Meanwhile, we report supervised performance on previous works. The row with ∆ means the improvements or reductions of GPT-4V compared to supervised systems. Methods CK+ Methods SFEW 2.0 Methods RAF-DB Cross-VAE [31] 94.96 IACNN [32] 50.98 TDGAN [33] 81.91 IL-CNN [35] 52.52 Cross-VAE [31] 84.81 IA-gen [34] 96.57 ADFL [36] 98.17 RAN-ResNet18 [37] 54.19 IPA2LT [38] 86.77 IDFERM [39] 98.35 FN2EN [40] 55.15 SCN [41] 88.14 TER-GAN [42] 98.47 Covariance Pooling [43] 58.14 IF-GAN [44] 88.33 IPD-FER [26] 98.65 IPD-FER [26] 58.43 IPD-FER [26] 88.89 GPT-4V 69.72 GPT-4V 57.24 GPT-4V 75.81 ∆ ↓28.93 ∆ ↓1.19 ∆ ↓13.08 Methods FERPlus Methods AffectNet PLD [10] 85.10 IPA2LT [38] 56.51 IPFR [46] 57.40 ResNet+VGG [45] 87.40 SCN [41] 88.01 gACNN [47] 58.78 IPD-FER [26] 88.42 IPD-FER [26] 62.23 GPT-4V 64.25 GPT-4V 42.77 ∆ ↓24.17 ∆ ↓19.46 Table 4: Performance comparison on visual sentiment analysis. Twitter I Methods Abstract Artphoto Twitter II Twitter I_5 Twitter I_4 Twitter I_3 SentiBank [14] 64.95 67.74 71.32 68.28 66.63 65.93 70.05 67.85 72.90 69.61 67.92 77.51 PAEF [48] DeepSentiBank [49] 71.19 68.73 76.35 70.15 71.25 70.23 70.84 70.96 82.54 76.52 76.36 77.68 PCNN [13] VGGNet [50] 68.86 67.61 83.44 78.67 75.49 71.79 AR (Concat) [51] 76.03 74.80 88.65 85.10 81.06 80.48 GPT-4V 71.81 80.40 97.81 94.63 90.71 87.95 ∆ ↓4.22 ↑5.60 ↑9.16 ↑9.53 ↑9.65 ↑7.47 4 Results and Discussion Main Results In Tables 3∼7, we report the supervised performance of existing methods and the zero-shot results obtained by GPT-4V. For facial emotion recognition (see Table 3), GPT-4V achieves comparable performance to the supervised methods in SFEW 2.0. Although performance gaps remain on other datasets, it is important to highlight that GPT-4V significantly outperforms random guessing. For visual sentiment analysis (see Table 4), GPT-4V outperforms supervised systems, indicating its strong capabilities in understanding emotions from visual content. However, GPT-4V performs poorly in micro-expression recognition (see Table 5), which indicates that GPT-4V is currently tailored for general domains. It is not suitable for areas that require specialized knowledge. Tables 6∼7 show the gap between GPT-4V and supervised systems on video understanding. It is worth noting that since only three frames are sampled for each video, some key frames may be ignored, resulting in the limited performance. Notably, GPT-4V shows good results on CMUMOSI. But there is still a gap for CH-SIMS and MER2023. The reason is that CMU-MOSI mainly emphasizes the lexical modality [77], aligning well with GPT-4V’s lexical understanding strengths. Robustness to Color Space In Table 3, GPT-4V performs slightly worse on CK+ and FERPlus. As both datasets have grayscale images, a reasonable hypothesis emerges: will GPT-4V performs worse when confronted with grayscale images? To explore this possibility, we convert all RGB images in RAF-DB to grayscale images and report results in Table 8. Interestingly, GPT-4V exhibits very similar performance across different color spaces. This resilience to color space changes suggests that GPT-4V is inherently robust in this regard. The performance gap on CK+ and FERPlus 4 Table 5: Performance comparison on micro-expression recognition. Methods CASME CASME II SAMM LBP-SIP [52] 36.84 46.56 36.76 STRCN-A [53] 40.93 45.26 32.85 STRCN-G [53] 59.65 63.37 53.48 60.23 63.21 36.00 VGGMag [54] LGCcon [29] 60.82 65.02 40.90 TSCNN [55] 73.88 80.97 71.76 GPT-4V 36.93 14.64 17.04 ∆ ↓36.95 ↓66.33 ↓54.72 Table 6: Performance comparison on dynamic facial emotion recognition (WAR/UAR). Methods DFEW Methods FERV39k C3D [56] 56.37/45.92 Former-DFER [57] 46.85/37.20 STT [59] 48.11/37.76 R3D18 [58] 53.27/43.34 VGG11-LSTM [19] 56.53/45.86 NR-DFERNet [60] 45.97/33.99 ResNet18-LSTM [19] 57.38/46.11 IAL [61] 48.54/35.82 ResNet50-LSTM [30] 63.32/51.08 M3DFEL [62] 47.67/35.94 DPCNet [30] 65.78/55.39 MAE-DFER [63] 52.07/43.12 GPT-4V 43.80/36.96 GPT-4V 34.29/24.92 ∆ ↓21.98/↓18.43 ∆ ↓17.78/↓18.20 Methods RAVDESS Methods eNTERFACE05 VO-LSTM [64] 60.50/3DCNN [65] 41.05/STA-FER [66] 42.98/AV-LSTM [64] 65.80/MCBP [67] 71.32/EC-LSTM [68] 49.26/MSAF [67] 74.86/FAN [69] 51.44/Graph-Tran [71] 54.62/CFN-SR [70] 75.76/MAE-DFER [63] 75.56/75.91 MAE-DFER [63] 61.64/61.67 GPT-4V 34.31/36.98 GPT-4V 33.28/33.11 ∆ ↓41.25/↓38.93 ∆ ↓28.36/↓28.56 Table 7: Performance comparison on multimodal emotion recognition. Methods CMU-MOSI CH-SIMS MER2023 MFM [72] 78.34 87.14 70.48 MISA [73] 79.08 89.82 82.42 77.78 87.00 77.40 MFN [74] MMIM [75] 79.38 88.68 81.01 TFN [25] 79.63 90.56 82.52 MulT [76] 81.41 91.07 82.94 GPT-4V 80.43 81.24 63.50 ∆ ↓0.98 ↓9.83 ↓19.44 may be attributed to inconsistencies between these datasets and the training data used by GPT-4V for facial emotion recognition. Temporal Understanding Ability To reduce evaluation costs, we uniformly sample three frames for each video. In this section, we further investigate the influence of varying sampling numbers. As shown in Table 9, when the number of sampled frames is reduced from 3 to 2, a significant degradation in performance is observed. This highlights the importance of increasing the number of sampled frames in future work. Simultaneously, the improved performance associated with more sampled frames shows GPT-4V’s ability to leverage temporal information within image sequences. Multimodal Understanding Ability This section evaluates the multimodal understanding capabilities of GPT-4V. Table 10 reports the unimodal and multimodal results on three benchmark datasets. Notably, for CH-SIMS and MER2023, we observe that multimodal results outperform unimodal 5 Table 8: Robustness to color space. Color space RAF-DB Grey 74.28 RGB 75.81 Table 9: Temporal understanding ability (WAR/UAR). # Sampled Frames RAVDESS eNTERFACE05 2 22.29/25.33 23.78/24.06 3 34.31/36.98 33.28/33.11 Table 10: Multimodal understanding ability of GPT-4V. Modality CMU-MOSI CH-SIMS MER2023 Text 82.32 70.07 34.06 Video 51.17 76.13 45.50 80.43 81.24 63.50 Multi results, demonstrating the GPT-4V’s ability to integrate and leverage multimodal information. But for CMU-MOSI, we observe a slight performance degradation in multimodal results. This dataset primarily relies on lexical information [77], and the incorporation of visual information may bring some confusion to GPT-4V to understand emotions. 5 Conclusion This paper provides a evaluation of GPT-4V’s performance in multimodal emotion understanding across five distinct tasks. We observe that GPT-4V has strong capabilities in understanding emotions from visual content, even surpassing supervised systems. However, it performs poorly in micro-expression recognition that requires specialized domain knowledge. Additionally, this paper demonstrates GPT-4V’s temporal and multimodal understanding capabilities and its robustness to different color spaces. Our work can also serve as a zero-shot benchmark for subsequent research. Due to the high cost of GPT-4V API, the paper uniformly samples 3 frames for video input. Future work will explore performance at higher sampling rates. Meanwhile, we will incorporate more emotion-related tasks and datasets to offer a comprehensive evaluation of GPT-4V. 6 References [1] Yunlong Liang, Fandong Meng, Ying Zhang, Yufeng Chen, Jinan Xu, and Jie Zhou. Emotional conversation generation with heterogeneous graph neural network. Artificial Intelligence, 308:103714, 2022. [2] Liqiang Nie, Wenjie Wang, Richang Hong, Meng Wang, and Qi Tian. Multimodal dialog system: Generating responses via adaptive decoders. In Proceedings of the 27th ACM International Conference on Multimedia, pages 1098–1106, 2019. [3] Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9:1, 2023. [4] Kevin Lin, Faisal Ahmed, Linjie Li, Chung-Ching Lin, Ehsan Azarnasab, Zhengyuan Yang, Jianfeng Wang, Lin Liang, Zicheng Liu, Yumao Lu, et al. Mm-vid: Advancing video understanding with gpt-4v (ision). arXiv preprint arXiv:2310.19773, 2023. [5] Peilin Zhou, Meng Cao, You-Liang Huang, Qichen Ye, Peiyan Zhang, Junling Liu, Yueqi Xie, Yining Hua, and Jaeboum Kim. Exploring recommendation capabilities of gpt-4v (ision): A preliminary case study. arXiv preprint arXiv:2311.04199, 2023. [6] Hanjia Lyu, Jinfa Huang, Daoan Zhang, Yongsheng Yu, Xinyi Mou, Jinsheng Pan, Zhengyuan Yang, Zhongyu Wei, and Jiebo Luo. Gpt-4v (ision) as a social media analysis engine. arXiv preprint arXiv:2311.07547, 2023. [7] Wenhao Wu, Huanjin Yao, Mengxi Zhang, Yuxin Song, Wanli Ouyang, and Jingdong Wang. Gpt4vis: What can gpt-4 do for zero-shot visual recognition? arXiv preprint arXiv:2311.15732, 2023. [8] Patrick Lucey, Jeffrey F Cohn, Takeo Kanade, Jason Saragih, Zara Ambadar, and Iain Matthews. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In 2010 ieee computer society conference on computer vision and pattern recognition-workshops, pages 94–101. IEEE, 2010. [9] Abhinav Dhall, OV Ramana Murthy, Roland Goecke, Jyoti Joshi, and Tom Gedeon. Video and image based emotion recognition challenges in the wild: Emotiw 2015. In Proceedings of the 2015 ACM on international conference on multimodal interaction, pages 423–426, 2015. [10] Emad Barsoum, Cha Zhang, Cristian Canton Ferrer, and Zhengyou Zhang. Training deep networks for facial expression recognition with crowd-sourced label distribution. In Proceedings of the 18th ACM international conference on multimodal interaction, pages 279–283, 2016. [11] Shan Li, Weihong Deng, and JunPing Du. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2852–2861, 2017. [12] Ali Mollahosseini, Behzad Hasani, and Mohammad H Mahoor. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 10(1):18–31, 2017. [13] Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. Robust image sentiment analysis using progressively trained and domain transferred deep networks. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pages 381–388, 2015. [14] Damian Borth, Rongrong Ji, Tao Chen, Thomas Breuel, and Shih-Fu Chang. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In Proceedings of the 21st ACM international conference on Multimedia, pages 223–232, 2013. [15] Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. Building a large scale dataset for image emotion recognition: the fine print and the benchmark. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pages 308–314, 2016. 7 [16] Wen-Jing Yan, Qi Wu, Yong-Jin Liu, Su-Jing Wang, and Xiaolan Fu. Casme database: A dataset of spontaneous micro-expressions collected from neutralized faces. In 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG), pages 1–7. IEEE, 2013. [17] Wen-Jing Yan, Xiaobai Li, Su-Jing Wang, Guoying Zhao, Yong-Jin Liu, Yu-Hsin Chen, and Xiaolan Fu. Casme ii: An improved spontaneous micro-expression database and the baseline evaluation. PLoS ONE, 9(1):e86041–e86041, 2014. [18] Adrian K Davison, Cliff Lansley, Nicholas Costen, Kevin Tan, and Moi Hoon Yap. Samm: A spontaneous micro-facial movement dataset. IEEE transactions on affective computing, 9(1):116–129, 2016. [19] Xingxun Jiang, Yuan Zong, Wenming Zheng, Chuangao Tang, Wanchuang Xia, Cheng Lu, and Jiateng Liu. Dfew: A large-scale database for recognizing dynamic facial expressions in the wild. In Proceedings of the 28th ACM international conference on multimedia, pages 2881–2889, 2020. [20] Yan Wang, Yixuan Sun, Yiwen Huang, Zhongying Liu, Shuyong Gao, Wei Zhang, Weifeng Ge, and Wenqiang Zhang. Ferv39k: A large-scale multi-scene dataset for facial expression recognition in videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20922–20931, 2022. [21] Steven R Livingstone and Frank A Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13(5):e0196391, 2018. [22] Olivier Martin, Irene Kotsia, Benoit Macq, and Ioannis Pitas. The enterface’05 audio-visual emotion database. In Proceedings of the 22nd International Conference on Data Engineering Workshops, pages 8–8. IEEE, 2006. [23] Zheng Lian, Haiyang Sun, Licai Sun, Kang Chen, Mngyu Xu, Kexin Wang, Ke Xu, Yu He, Ying Li, Jinming Zhao, et al. Mer 2023: Multi-label learning, modality robustness, and semisupervised learning. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9610–9614, 2023. [24] Wenmeng Yu, Hua Xu, Fanyang Meng, Yilin Zhu, Yixiao Ma, Jiele Wu, Jiyun Zou, and Kaicheng Yang. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 3718–3727, 2020. [25] Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1103–1114, 2017. [26] Jing Jiang and Weihong Deng. Disentangling identity and pose for facial expression recognition. IEEE Transactions on Affective Computing, 13(4):1868–1878, 2022. [27] Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. Challenges in representation learning: A report on three machine learning contests. In Proceedings of the 20th International Conference on Neural Information Processing, pages 117–124, 2013. [28] Paul Ekman. Lie catching and microexpressions. The philosophy of deception, 1(2):5, 2009. [29] Yante Li, Xiaohua Huang, and Guoying Zhao. Joint local and global information learning with single apex frame detection for micro-expression recognition. IEEE Transactions on Image Processing, 30:249–263, 2020. [30] Yan Wang, Yixuan Sun, Wei Song, Shuyong Gao, Yiwen Huang, Zhaoyu Chen, Weifeng Ge, and Wenqiang Zhang. Dpcnet: Dual path multi-excitation collaborative network for facial expression representation learning in videos. In Proceedings of the 30th ACM International Conference on Multimedia, pages 101–110, 2022. 8 [31] Haozhe Wu, Jia Jia, Lingxi Xie, Guojun Qi, Yuanchun Shi, and Qi Tian. Cross-vae: Towards disentangling expression from identity for human faces. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4087–4091. IEEE, 2020. [32] Zibo Meng, Ping Liu, Jie Cai, Shizhong Han, and Yan Tong. Identity-aware convolutional neural network for facial expression recognition. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pages 558–565. IEEE, 2017. [33] Siyue Xie, Haifeng Hu, and Yizhen Chen. Facial expression recognition with two-branch disentangled generative adversarial network. IEEE Transactions on Circuits and Systems for Video Technology, 31(6):2359–2371, 2020. [34] Huiyuan Yang, Zheng Zhang, and Lijun Yin. Identity-adaptive facial expression recognition through expression regeneration using conditional generative adversarial networks. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 294–301. IEEE, 2018. [35] Jie Cai, Zibo Meng, Ahmed Shehab Khan, Zhiyuan Li, James O’Reilly, and Yan Tong. Island loss for learning discriminative features in facial expression recognition. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 302– 309. IEEE, 2018. [36] Mengchao Bai, Weicheng Xie, and Linlin Shen. Disentangled feature based adversarial learning for facial expression recognition. In 2019 IEEE International Conference on Image Processing (ICIP), pages 31–35. IEEE, 2019. [37] Kai Wang, Xiaojiang Peng, Jianfei Yang, Debin Meng, and Yu Qiao. Region attention networks for pose and occlusion robust facial expression recognition. IEEE Transactions on Image Processing, 29:4057–4069, 2020. [38] Jiabei Zeng, Shiguang Shan, and Xilin Chen. Facial expression recognition with inconsistently annotated datasets. In Proceedings of the European conference on computer vision (ECCV), pages 222–237, 2018. [39] Xiaofeng Liu, BVK Vijaya Kumar, Ping Jia, and Jane You. Hard negative generation for identity-disentangled facial expression recognition. Pattern Recognition, 88:1–12, 2019. [40] Hui Ding, Shaohua Kevin Zhou, and Rama Chellappa. Facenet2expnet: Regularizing a deep face recognition net for expression recognition. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pages 118–126. IEEE, 2017. [41] Kai Wang, Xiaojiang Peng, Jianfei Yang, Shijian Lu, and Yu Qiao. Suppressing uncertainties for large-scale facial expression recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6897–6906, 2020. [42] Kamran Ali and Charles E Hughes. Facial expression recognition by using a disentangled identity-invariant expression representation. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 9460–9467. IEEE, 2021. [43] Dinesh Acharya, Zhiwu Huang, Danda Pani Paudel, and Luc Van Gool. Covariance pooling for facial expression recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 367–374, 2018. [44] Jie Cai, Zibo Meng, Ahmed Shehab Khan, James O’Reilly, Zhiyuan Li, Shizhong Han, and Yan Tong. Identity-free facial expression recognition using conditional generative adversarial network. In 2021 IEEE International Conference on Image Processing (ICIP), pages 1344– 1348. IEEE, 2021. [45] Christina Huang. Combining convolutional neural networks for emotion recognition. In 2017 IEEE MIT undergraduate research technology conference (URTC), pages 1–4. IEEE, 2017. 9 [46] Can Wang, Shangfei Wang, and Guang Liang. Identity-and pose-robust facial expression recognition through adversarial feature learning. In Proceedings of the 27th ACM international conference on multimedia, pages 238–246, 2019. [47] Yong Li, Jiabei Zeng, Shiguang Shan, and Xilin Chen. Occlusion aware facial expression recognition using cnn with attention mechanism. IEEE Transactions on Image Processing, 28(5):2439–2450, 2018. [48] Sicheng Zhao, Yue Gao, Xiaolei Jiang, Hongxun Yao, Tat-Seng Chua, and Xiaoshuai Sun. Exploring principles-of-art features for image emotion recognition. In Proceedings of the 22nd ACM international conference on Multimedia, pages 47–56, 2014. [49] Tao Chen, Damian Borth, Trevor Darrell, and Shih-Fu Chang. Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks. arXiv preprint arXiv:1410.8586, 2014. [50] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [51] Jufeng Yang, Dongyu She, Ming Sun, Ming-Ming Cheng, Paul L Rosin, and Liang Wang. Visual sentiment prediction based on automatic discovery of affective regions. IEEE Transactions on Multimedia, 20(9):2513–2525, 2018. [52] Yandan Wang, John See, Raphael C-W Phan, and Yee-Hui Oh. Lbp with six intersection points: Reducing redundant information in lbp-top for micro-expression recognition. In Computer Vision–ACCV 2014: 12th Asian Conference on Computer Vision, Singapore, Singapore, November 1-5, 2014, Revised Selected Papers, Part I 12, pages 525–537. Springer, 2015. [53] Zhaoqiang Xia, Xiaopeng Hong, Xingyu Gao, Xiaoyi Feng, and Guoying Zhao. Spatiotemporal recurrent convolutional networks for recognizing spontaneous micro-expressions. IEEE Transactions on Multimedia, 22(3):626–640, 2019. [54] Yante Li, Xiaohua Huang, and Guoying Zhao. Can micro-expression be recognized based on single apex frame? In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 3094–3098. IEEE, 2018. [55] Baolin Song, Ke Li, Yuan Zong, Jie Zhu, Wenming Zheng, Jingang Shi, and Li Zhao. Recognizing spontaneous micro-expression using a three-stream convolutional neural network. IEEE Access, 7:184537–184551, 2019. [56] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015. [57] Zengqun Zhao and Qingshan Liu. Former-dfer: Dynamic facial expression recognition transformer. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1553– 1561, 2021. [58] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018. [59] Fuyan Ma, Bin Sun, and Shutao Li. Spatio-temporal transformer for dynamic facial expression recognition in the wild. arXiv preprint arXiv:2205.04749, 2022. [60] Hanting Li, Mingzhe Sui, Zhaoqing Zhu, et al. Nr-dfernet: Noise-robust network for dynamic facial expression recognition. arXiv preprint arXiv:2206.04975, 2022. [61] Hanting Li, Hongjing Niu, Zhaoqing Zhu, and Feng Zhao. Intensity-aware loss for dynamic facial expression recognition in the wild. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 67–75, 2023. 10 [62] Hanyang Wang, Bo Li, Shuang Wu, Siyuan Shen, Feng Liu, Shouhong Ding, and Aimin Zhou. Rethinking the learning paradigm for dynamic facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17958–17968, 2023. [63] Licai Sun, Zheng Lian, Bin Liu, and Jianhua Tao. Mae-dfer: Efficient masked autoencoder for self-supervised dynamic facial expression recognition. In Proceedings of the 31st ACM International Conference on Multimedia, pages 6110–6121, 2023. [64] Esam Ghaleb, Mirela Popa, and Stylianos Asteriadis. Multimodal and temporal perception of audio-visual cues for emotion recognition. In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), pages 552–558. IEEE, 2019. [65] Young-Hyen Byeon and Keun-Chang Kwak. Facial expression recognition using 3d convolutional neural network. International journal of advanced computer science and applications, 5(12), 2014. [66] Xianzhang Pan, Guoliang Ying, Guodong Chen, Hongming Li, and Wenshu Li. A deep spatial and temporal aggregation framework for video-based facial expression recognition. IEEE Access, 7:48807–48815, 2019. [67] Lang Su, Chuqing Hu, Guofa Li, and Dongpu Cao. Msaf: Multimodal split attention fusion. arXiv preprint arXiv:2012.07175, 2020. [68] Ryo Miyoshi, Noriko Nagata, and Manabu Hashimoto. Enhanced convolutional lstm with spatial and temporal skip connections and temporal gates for facial expression recognition from video. Neural Computing and Applications, 33:7381–7392, 2021. [69] Debin Meng, Xiaojiang Peng, Kai Wang, and Yu Qiao. Frame attention networks for facial expression recognition in videos. In 2019 IEEE international conference on image processing (ICIP), pages 3866–3870. IEEE, 2019. [70] Ziwang Fu, Feng Liu, Hanyang Wang, Jiayin Qi, Xiangling Fu, Aimin Zhou, and Zhibin Li. A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition. arXiv preprint arXiv:2111.02172, 2021. [71] Rui Zhao, Tianshan Liu, Zixun Huang, Daniel PK Lun, and Kin-Man Lam. Spatial-temporal graphs plus transformers for geometry-guided facial expression recognition. IEEE Transactions on Affective Computing, 2022. [72] Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency, and Ruslan Salakhutdinov. Learning factorized multimodal representations. In Proceedings of the 7th International Conference on Learning Representations, pages 1–20, 2019. [73] Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1122–1131, 2020. [74] Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and LouisPhilippe Morency. Memory fusion network for multi-view sequential learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5634–5641, 2018. [75] Wei Han, Hui Chen, and Soujanya Poria. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9180–9192, 2021. [76] Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pages 6558–6569, 2019. [77] Zheng Lian, Bin Liu, and Jianhua Tao. Smin: Semi-supervised multi-modal interaction network for conversational emotion recognition. IEEE Transactions on Affective Computing, 2022. 11 A Prompt Design In this section, we design prompts for different tasks. Please replace “{Candidate labels}” in the following prompts into the labels of each dataset. Facial Emotion Recognition Please play the role of a facial expression classification expert. We provide 20 images. Please ignore the speaker’s identity and focus on the facial expression. For each image, please sort the provided categories from high to low according to the similarity with the input. Here are the optional categories: {Candidate labels}. The output format should be {’name’:, ’result’:} for each image. Visual Sentiment Analysis Please play the role of a emotion recognition expert. We provide 20 images. Please recognize sentiments evoked by these images (i.e., guess how viewer might emotionally feel after seeing these images.) For each image, please sort the provided categories from high to low according to the similarity with the input. Here are the optional categories: {Candidate labels}. If there is a person in the image, ignore that person’s identity. The output format should be {’name’:, ’result’:} for each image. Micro-expression Recognition Please play the role of a micro-expression recognition expert. We provide 20 images. For each image, please sort the provided categories from high to low according to the similarity with the input. The expression may not be obvious, please pay attention to the details of the face. Here are the optional categories: {Candidate labels}. Please ignore the speaker’s identity and focus on the facial expression. The output format should be {’name’:, ’result’:} for each image. Dynamic Facial Emotion Recognition Please play the role of a video expression classification expert. We provide 6 videos, each with 3 temporally uniformly sampled frames. For each video, please sort the provided categories from high to low according to the similarity with the input. Here are the optional categories: {Candidate labels}. Please ignore the speaker’s identity and focus on the facial expression. The output format should be {’name’:, ’result’:} for each video. Multimodal Emotion Recognition Please play the role of a video expression classification expert. We provide 6 videos, each with the speaker’s content and 3 temporally uniformly sampled frames. For each video, please sort the provided categories from high to low according to the similarity with the input. Here are the optional categories: {Candidate labels}. Please ignore the speaker’s identity and focus on their emotions. The output format should be {’name’:, ’result’:} for each video. 12