IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. XX, NO. X, DECEMBER 2023

1

MSEVA : A System for Multimodal Short Videos
Emotion Visual Analysis

arXiv:2312.04279v1 [cs.SI] 7 Dec 2023

Qinglan Wei, Member, IEEE, Yaqi Zhou, Yuan Zhang∗

Abstract—YouTube Shorts, a new section launched by YouTube
in 2021, is a direct competitor to short video platforms like
TikTok. It reflects the rising demand for short video content
among online users. Social media platforms are often flooded
with short videos that capture different perspectives and emotions
on hot events. These videos can go viral and have a significant
impact on the public’s mood and views. However, short videos’
affective computing was a neglected area of research in the
past. Monitoring the public’s emotions through these videos
requires a lot of time and effort, which may not be enough
to prevent undesirable outcomes. In this paper, we create the
first multimodal dataset of short video news covering hot events.
We also propose an automatic technique for audio segmenting
and transcribing. In addition, we improve the accuracy of
the multimodal affective computing model by about 4.17% by
optimizing it. Moreover, a novel system MSEVA for emotion
analysis of short videos is proposed. Achieving good results on
the bili-news dataset, the MSEVA system applies the multimodal
emotion analysis method in the real world. It is helpful to conduct
timely public opinion guidance and stop the spread of negative
emotions. Data and code from our investigations can be accessed
at: http://xxx.github.com.
Index Terms—Multimodal data, emotion analysis, short videos,
social media.

I. I NTRODUCTION
HORT video has become a wide range of production and
dissemination of a multimodal media format attributed to
their convenience and accessibility. With the rise of mobile
internet technology, a variety of short video platforms has
opened up a short video era for the audience. The length of
a short video is usually measured in seconds. It refers to a
new short video that is played on the network platform for
people to watch, browse, and share at any time. It spreads
to the audience through mobile internet technology, with
entertainment, fashion, and opinions about current events as
the main content, so as to gain the attention of the audience
[1].
Nowadays, short video is one of the most important media formats for the dissemination of hot events or topics.
The main characteristics of short videos are as follows:
First, short videos contain rich modal content such as video,
audio, and text, with each modality being crucial for emotion analysis. Therefore, this paper employs the method of
multimodal emotion analysis to analyze the emotion of short
videos by combining the video, audio, and textual data of
short videos. Second, short videos have high transmission

S

The authors are with the Communication University of China,
Beijing 100000, China (e-mail: qlwei@cuc.edu.cn; yqzhou@cuc.edu.cn;
yzhang@cuc.edu.cn)
*: Corresponding author

speeds. Due to the rich social interaction functions of the
short video platform, users can comment on short videos,
forward them, and even create short videos inspired by their
favorite content. Consequently, analyzing the inherent emotion
of popular short videos related to hot events can help us
comprehend public attitudes and anticipate the direction of
public opinions. Third, short videos typically have simple but
strong emotions. Because short videos have the characteristics
of fragmented transmission, they abandon the form and logic
of traditional videos in the past. Instead, short videos are
created for strong emotional impacts on the audience in a
short duration, in order to gain high likes and comments.
Thus, compared to traditional videos, conducting multimodal
emotion analysis on short videos can usually get more accurate
results, which effectively harnesses the potential of short
videos as a burgeoning multimedia resource.
In our analysis of hot events on short video platforms, we
observed that the emotions of short videos posted by state
media exert significant influence on we media and the public.
The analysis for different platforms and hot events is shown
in Figure 1. Taking the Chinese short video platform Bilibili
as an example, in response to Japan’s decision to discharge
nuclear wastewater from the Fukushima nuclear power plant
into the Pacific Ocean, CCTV News posted a short video titled
“Associate with Evil Elements”. This short video condemned
Japan’s action of releasing nuclear wastewater with an angry
tone. Subsequently, another short video was posted by Chinese
we media with the title “The Discharge of Nuclear Wastewater
Has Not Been Discussed Yet”, urging the Chinese public to
recognize the dangers of Japan’s nuclear wastewater discharge.
These two short videos both have high view counts. In short
video platforms, the emotional interpretations are popular with
the audience, which can be said to keep up with the trend
of the times [1]. Such short videos often trigger widespread
emotional resonance due to the strong interactivity of internet
platforms. Therefore, emotion analysis on short videos is a
significant research focus. The MSEVA system that we
designed can monitor the latent emotions of short videos on
short video platforms and regulate public opinion quickly.
The key contributions of this work are as follows:
1) A multimodal short video dataset named bili-news is
constructed, which has overall annotation of short video
emotions. The automatic audio segmentation and transcription
method is proposed to improve the efficiency throughout the
dataset construction process. Also, the annotation enhances the
emotion recognition. The dataset is openly accessible.
2) We have improved the accuracy of the result by approximately 4.17% by optimizing the method of text modality based

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. XX, NO. X, DECEMBER 2023

2

TABLE I
S EGMENTATION AND TRANSCRIPTION METHODS OF CURRENT
MULTIMODAL DATASET

Fig. 1. Examples of emotional influence from state media on we media

on the V2EM [2] multimodal emotion analysis model. Additionally, we conduct experiments considering the comparison
of state-of-the-art small and large language models.
3) The MSEVA system is a novel emotion analysis approach
that we propose in this paper. It is designed for short videos
and addresses the research gap discussed in Section II-B
and II-C. The system achieves end-to-end emotion analysis
for short videos and provides visualized results, including
emotions of comprehensive analysis, emotions of individual
modalities, and temporal analysis. The system is open-source.
II. R ELATED W ORK
A. Datasets of Multimodal Emotion Analysis
Among current datasets, the dialogs of the IEMOCAP [3]
dataset were manually segmented at the dialog turn level,
and the professional transcription was obtained from Ubiqus.
All videos in CMU-MOSEAS [4] have manual and punctuated transcriptions. Punctuation markers are used to separate
sentences, similar to CMU-MOSEI [5]. In the process of
dataset construction, we found that current methods mainly
rely on manual segmentation and transcription, as shown in
Table I. Although this method ensures dataset accuracy and
richness, it has some limitations. The manual segmentation
and transcription processes need substantial human effort and
time, hindering the update of the dataset. Besides, the utterance
level segmentation and annotation might overlook the overall
emotions of short videos, which is a focal point of our study.
The current datasets for multimodal emotion analysis are
composed of segments of long videos. However, the short
video is an independent and complete media form that differs
from segments of long videos. We need to focus on the short
videos on social media platforms.
The construction of datasets currently depends on the
manual segmentation and annotation method, which requires a lot of human effort. In addition, most datasets
consist of long videos with emotional annotations for each
utterance. Hence, we need to build a dataset for short
videos with emotional annotations for the whole video.
B. General Multimodal Analysis of Short Videos
Early research mainly focused on short videos on the
Vine platform. In 2014, Redi et al. [9] proposed a set of

Dataset

Manual
Segmentation

Automatic
Segmentation

Manual
Transcription

Automatic
Transcription

IEMOCAP [3]
CMU-MOSI [6]
CMU-MOSEI [5]
UR-FUNNY [7]
CH-SIMS [8]
CMU-MOSEAS [4]

!
!
!
!
!
!

#
#
#
#
#
#

#
!
!
!
!
!

!
#
#
#
#
#

computational features like audio features and visual features
that they map to the components of creativity and a supervised
approach to automatically detect creative videos. In 2016,
Zhang et al. [10] proposed a tree-guided multi-task multimodal learning model to estimate the venue category for each
unseen microvideo. In the same year, Chen et al. [11] proposed
a TMALL model for popularity prediction, which was the
earliest prediction analysis on the popularity of short videos.
In 2020, for Tiktok and MovieLens, two micro-video recommendation datasets, Tao et al. [12] developed a new method
MGAT, which incorporates attention mechanism into the graph
neural network framework, to disentangle user preferences
on different modalities. In 2023, Qi et al. [13] constructed
FakeSV, the largest short video dataset about fake news based
on Douyin and Kuaishou, and provided a new multimodal
detection model SV-FEND which exploits the cross-modal
correlations to select the most informative features and utilizes
the social context information for detection.
The analysis of multimodal data of short videos is a
hot research topic, involving areas such as popularity
prediction, location classification, video recommendations,
and fake video detection. However, current studies only
focus on objective features and their relation with user
behavior, neglecting the intrinsic emotion.
C. General Multimodal Emotion Analysis of Videos
Early research on video emotion analysis was not in the
wild, their research primarily focused on movie segments and
movie review data. In 2003, Kang et al. [14] discussed a new
technique for detecting affective events using Hidden Markov
Models (HMM), based on low-level features, including color,
motion, and shot cut rate. In 2006, Wang et al. [15] proposed
the combination of visual and audio features with support
vector machines and achieved good results. In 2013, with the
rapid development of multimedia social platforms, Wollmer et
al. [16] focused on automatically analyzing a speaker’s sentiment in online videos containing movie reviews. In addition to
textual information, this approach considers adding audio features as typically used in speech-based emotion recognition as
well as video features encoding valuable valence information
conveyed by the speaker. In 2018, in order to process a large

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. XX, NO. X, DECEMBER 2023

number of online videos and improve the processing power of
real-time emotion analysis, Tran et al. [17] proposed a realtime multimodal emotion analysis model, which leveraged the
processing speed of extreme learning machine and graphics
processing unit to overcome the limitations of standard learning algorithms and central processing unit (CPU).
There is some research that take GIFs as objects of emotion
analysis, which are similar to our study. However, these GIFs
consist of only a few frames, which is quite different from
short videos. In 2014, Jou et al. [18] proposed the first
model to predict the emotions perceived by viewers after
they are shown animated GIF images. In 2019, Yang et al.
[19] proposed KAVAN network which consists of a facial
attention module and a hierarchical segment temporal module
to conduct human-centered GIF emotion recognition.
As for the end-to-end video emotion analysis methods,
there are still few relevant studies. General existing works
on multimodal emotion analysis adopt a two pipeline, first
extracting feature representations for each single modality
and then performing end-to-end learning with the extracted
feature. In 2020, Zhao et al. [20] proposed to recognize video
emotions in an end-to-end manner based on convolutional
neural networks (CNNs), and developed a deep Visual-Audio
Attention Network (VAANet). In 2021, Dai et al. [21] developed a fully end-to-end model FE2E that connects the two
phases and optimizes them jointly. In 2022, Wei et al. [2]
designed a fully multimodal video-to-emotion system FV2ES
for fast yet effective recognition inference. For visual modality,
FV2ES used RepVGG to improve the efficiency of multimodal emotion analysis. The Hierarchical-Attention Spectrum
Computing Module was used to improve the accuracy of the
model for audio modality, and the pre-trained Albert model
was used for feature extraction and prediction for textual
modality.
Earlier research on video emotion analysis mostly concentrated on movie segments and review data, and the endto-end emotion analysis was limited. Even though there
were some multimodal affective computing methods, they
were not suitable for short videos.
III. O UR W ORK
A. Bili-News Dataset Construction
As discussed in Section II-A, automatic utterance-level
segmentation and transcription methods have not been adopted
in current multimodal emotion analysis datasets. Most existing
datasets focus on emotion for utterance-segmented videos,
lacking overall annotations for the emotions of the entire short
videos. In this section, we present the bili-news dataset
construction, which involves two steps: (a) employing
automatic segmentation and transcription methods and
(b) selecting and assigning overall emotion annotations for
the dataset. The following subsections describe the process
of constructing this dataset in more detail.
1) Automatic Segmentation and Transcription Method
In this section, we propose the first automatic segmentation
and transcription method and use it in the process of bili-news
construction. According to the speaker’s speech rhythm, we

3

Fig. 2. The process of automatic segmentation and transcription method

segment the audio part of short videos and obtain the start
time and end time of each sentence. We then feed the audio
segments to the Whisper model [22] which transcribes the
speech into English text in a consistent way. The process
is shown in Figure 2. This method greatly reduces the cost
of manual segmentation and transcription and enhances the
efficiency of dataset construction.
The detect-silence function in pydub library is used to detect
the silence interval in speech. According to our experiments,
a threshold of 0.8 seconds was selected as the cutoff for
segmenting the original audio into short segments corresponding to each sentence. Subsequently, for each short segment,
the Whisper model is utilized for speech recognition and
translation, generating the subtitle text of each sentence. The
segmentation timestamps and subtitle texts were then outputted
into files. Since the Whisper model has the available pretrained optimal model that can be directly utilized. According to the universality of speech recognition and translation
tasks, there is no need to add datasets for fine-tuning in
practical applications, so this paper will not train the Whisper
model additionally. Moreover, the Whisper model supports
multi-language speech recognition and translation tasks, such
as Chinese→Chinese, English→English, Chinese→English,
Korean→English, and so on, which enables the automatic
utterance-level segmentation and transcription of audio in
multiple languages.
2) Selecting and Assigning Emotion Annotations
Firstly, we crawled 1820 short videos related to recent
hot events from the Bilibili platform. Secondly, we designed
special criteria for our research and manually selected the
short videos, resulting in a final set of 165 videos. Thirdly,
we invited 12 crowdsourced judges to annotate the emotion
of the entire short video in the bili-news dataset. Then, we
dropped short videos with unclear emotion annotations, which
may not be significant in our research. We ultimately retained
147 short videos and validated the consistency of the labeling
dataset. The details of the selection process and the short video
subjective annotation experiment are as follows.
To ensure the short videos in the bili-news dataset meet
our research, we designed some criteria for selecting the short
videos we crawled: (a) featuring one or two main characters;
(b) the speech is clear and in the same language; (c) the
duration is less than three minutes; (d) having a simple and
strong emotion. Additionally, we dropped the policy-related
short videos to ensure the objectivity of the dataset.
For the short videos we selected, we organized a short

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. XX, NO. X, DECEMBER 2023

video subjective evaluation experiment to label the emotion
of short videos. In order to control the quality of annotation, a
qualification test is designed for the judges. Through the test,
the judges who have the habit of browsing short videos and
can clearly judge the emotion of short videos are selected.
For some English short videos of the dataset, the judges with
good scores in CET-4 and CET-6 are specially selected to
annotate these videos. Our experiment included 12 crowdsourced judges (6 men and 6 women). Each short video was
randomly assigned to a group of 3 people to annotate for
negative, positive, and uncertain labels. In order to ensure
the effectiveness of annotation, this paper provides training
before the experiment to help judges better distinguish positive
and negative emotions. This training introduces the Positive
and Negative Affect Schedule (PANAS) of psychology. After
learning 20 different specific descriptions of positive and
negative emotions, the judges annotated the positive and
negative intensity of emotions in short videos. This paper
selects the most one among the three annotations to label short
videos. Only when at least two annotators agree with the same
exact emotion, the short video annotation is considered valid.
Finally, 147 short videos are retained in the dataset.
To measure the annotation consistency among different
judges, this paper calculates the Fleiss ’kappa of the labels
of 3 judges in the constructed bili-news dataset, then obtains
K>0.65, which has a considerable degree of consistency in
the annotations. In addition, in order to verify the quality of
annotations, this section selected annotated short videos with
different annotations which may be confusing, and invited a
new judge to annotate selected short videos, and 96% of the
annotations were the same as the original labels. According
to the new annotations, we also calculated Cohen’s Kappa to
measure the consistency with the original annotations, then
the result was K>0.85. This shows good consistency which
proves that the bili-news dataset is available.
B. Optimize Multimodal Emotion Analysis Model
In this section, we proposed a more effective multimodal
emotion analysis model V2EM-RoBERTa based on the V2EM
model [2] by optimizing the method of text modality. We
investigated recent multimodal affective computing models.
Then, we performed some experiments with small language
models that are commonly used. Moreover, we employed
state-of-the-art large language models for text modality inference. We then contrasted the results of the experiments
using small and large language models respectively. The
details of our work are as follows.
In this section, we conducted experiments and optimizations
on its textual modality approach. The reason for selecting
the textual modality is shown in Table II, where we show
multimodal emotion analysis developed over the past three
years. In the “Modality” column, it is evident that almost
all recent models have integrated visual modality (V), textual
modality (T), and acoustic modality (A) for comprehensive
analysis. In the “Effect” column, we can see that almost the
textual modality has the greatest impact, so we optimize our
approach for the textual modality to maximize the performance
of the model.

4

TABLE II
S UMMARY OF CURRENT MULTIMODAL EMOTION ANALYSIS MODEL

Years

Method

Modality

Effect

Textual
Model

2021
2021
2021
2022
2022
2022
2023
2023

CMCN [24]
FE2E [21]
HFU-BERT [25]
AMOA [26]
CERS [27]
FV2ES [2]
QAP [28]
TETFN [29]

V+T
V+T+A
V+T+A
V+A+T
V+A+T
V+A+T
V+A+T
V+A+T

T>V
T>V>A
T>A>V
T>A>V
T>A>V
T>V>A
T>V>A

BERT
Transformer
BERT
BERT
BART
ALBERT
ALBERT
BERT

Fig. 3.
model

The architecture of V2EM-Roberta multimodal emotion analysis

From Table II, we can find that most methods in textual
modality use pre-trained models of BERT-based models for
textual feature extraction. Therefore, we explored various
BERT-based models and some small language models for
textual features in our experiments, as detailed in Chapter 4.
Among them, the RoBERTa model [23] has a larger number
of model parameters, uses a larger batch size during training,
and uses more datasets including CC-News for training, so it
has superior performance in our experiments.
Recently, large language models have become very popular. Considering the partial similarity between the tasks
of text emotion analysis and multimodal emotion analysis,
we attempted to employ large language models for textual
modality emotion analysis. Subsequently, we combined the
results from three modalities through linear fusion to get
the final prediction. Considering the size of the number of
parameters, we select a large language model with the number
of parameters ranging from 200M to 500M for comparative
experiments. The experimental results are shown in Chapter 4.
The results indicated that the performance of large language
models was not as good as small language models trained on
the dataset, which verifies the conclusion drawn by Zhang et
al. [30] that LLMs lag behind in more complex tasks requiring
deeper understanding or structured sentiment information.
Therefore, based on the open-source end-to-end V2EM
model [2], we proposed a more effective multi-modal emotion
analysis model named V2EM-RoBERTa. For the visual modality, the V2EM-RoBERTa takes the capture of image frames

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. XX, NO. X, DECEMBER 2023

at fixed intervals as input. Due to short videos containing
explicit subjects, facial expression is the most important for
emotion analysis of a video frame. Indeed, the mtcnn face
recognition model is used to intercept the face part of the video
frame, and then the RepVGG network is used to extract visual
features. The visual features are encoded using a Transformer
model with a location-embedded layer containing mode time
information. For the acoustic modality, the V2EM-RoBERTa
model will extract the log-mel frequency feature of the original audio, expand it into two-dimensional frequency feature
graphs, divide the feature graph into 16 sub-graph sequences,
and input them into NesT structure to extract acoustic features.
Then the features were inputted into the Transformer model
which can model time information to encode the data. For
the text modality, we extract the features of the text using the
pretrained small language model RoBERTa, then we use the
Transformer model to extract the temporal features of the text.
Finally, the data of all modalities are input into the Forward
Feed Network and get the prediction of all modalities, and
we used the linear fusion to get the final prediction. The
architecture of the multimodal emotion analysis model in this
paper is shown in Figure 3.

5

TABLE III
T HE COMPRESSION STRATEGY OF SHORT VIDEOS WITH DIFFERENT
RESOLUTIONS

Original video
resolution

Target video resolution

(470∼490)*(550∼570)
(845∼865)*(470∼490)
(470∼490)*(840∼860)
(1070∼1090)*(1910∼1930)

180*224
214*120
120*214
144*216

and portrait orientations. After several experiments, we devised
a compression strategy for different types of videos, as shown
in Table III. The facial detection inputted into the V2EMRoberta model is shown in Figure 5. The left image is the
facial detection from the dataset for model training, while the
right image is the facial detection from the short video for
model inference. It is evident that our compression strategy
ensures comparable face image size during both training and
inference.
IV. E XPERIMENTS

C. The Construction of the MSEVA System

A. Statistical Analysis of Bili-news Dataset

The main flow and components of the MSEVA system are
shown in Figure 4. The system has three main modules: (a)
Data Format Preprocessing Module: this module transforms
the short video file that users provide to enable adaptable
handling of short videos with various resolutions. (b) Automatic Segmentation and Transcription Module: the module
is designed according to the method proposed in Section III-A
for utterance-level segmentation and transcription. (c) the
Pre-trained multimodal emotion analysis Model (V2EMRoberta): the aligned modalities after segmentation are fed
into this module to obtain the final result.
In the experiment, we found that some short videos with
long durations caused substantial memory occupation, but
there was no increase in accuracy. To address this problem
and finer emotion analysis for short videos, we developed
an automatic segmentation and transcription module based on
the method in Section III-A. This module generates text files
containing the start and end timestamps of each sentence along
with the corresponding subtitle text. Then, the segmented
audio and video, together with the subtitle text are input into
the V2EM-Roberta model for multimodal emotion analysis.
The visual modality processing approach of the V2EMRoberta model using the RepVGG net, where the input face
images are of size 48*48. Because of the various resolutions
and diverse sizes of face images in the Bilibili platform’s short
videos, we need to conduct preprocess operations to standardize the data format. The data format preprocessing module is
essential to enable the system’s adaptability to short videos
with different resolutions. Our data format preprocessing
module workflow is as follows. We use the FFmpeg tool to
convert mp4 and mp3 to avi and wav. Then, According to the
statistics in this section, there are four types of short video
resolution in the Bili-news dataset, including both landscape

The bili-news dataset has four distinctive characteristics:
(a) Definite Emotion: Short videos have a distinct and strong
emotion, with a balanced ratio of positive to negative emotions.
(b) Diverse Durations: The dataset has a variety of short video
durations. (c) Bilingual Content: Short videos in the bili-news
dataset have English and Chinese languages. (d) Various
Posters: Short videos are given by people with different media
institutions. The short videos in this dataset show a remarkable
diversity, aligning well with the rich variety of short video
content typically found on social media platforms, which
facilitates objective evaluations. A partial screenshot of the
bili-news dataset is shown in Figure 6.
In this section, we calculate the emotion annotation of short
videos in the bili-news dataset, and the statistical information
is shown in column one of Table IV. There are 236 positive
annotations, 185 negative annotations, and only 20 uncertain
annotations, which verify that the short videos in this dataset
have clear emotions.
Regarding the duration of short videos, if the duration is
too short, the video may not effectively convey emotions
and might lack the ability to guide or propagate emotions.
Conversely, if the duration is excessively long, the video’s
emotion might change halfway through, potentially conveying
positive emotion in the first half and negative emotions in
the latter half. Considering the research significance of this
paper, we select short videos with durations below three minutes, ensuring that the emotion remains consistent throughout.
Among the short videos in the bili-news dataset, there are 40
short videos that lasted less than one minute, 58 short videos
that lasted one to one and a half minutes, 37 short videos that
lasted one and a half to two minutes, seven short videos that
lasted two to two and a half minutes, and five short videos that
lasted longer than two and a half minutes. The video duration

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. XX, NO. X, DECEMBER 2023

6

Fig. 4. The architecture of multimodal short videos emotion visual analysis (MSEVA) System

Fig. 5. The example for the similar resolution of face area after our data
format preprocessing module (the left image is the input during training of
IEMOCAP dataset, and the right image is the input from short video during
inference of Bili-news dataset)

the bili-news data set, there are not only 115 short videos
in Chinese but also 32 short videos in English. The ratio of
Chinese to English is about 4:1, as shown in column three of
Table IV.
The short videos in this dataset are posted by both state media and we media. This dataset encompasses content from 28
prominent Bilibili accounts, including six we media accounts
and 22 state media accounts. Specifically, 94 short videos are
posted by CCTV News, seven from Phoenix Satellite TV, and
six from CGTN. A detailed distribution of the short video
poster is shown in column four of Table IV.
B. Ablation Study of Automatic Segmentation and Transcription Module

Fig. 6. Example snapshots of short videos from our bili-news dataset

distribution of the bili-news dataset is shown in column two
of Table IV.
The languages of this dataset are not only Chinese but
also English. In the process of dataset construction, we don’t
restrict the language, except each short video is required to
appear in only one language in short videos. Therefore, in

To validate the necessity of automatic segmentation and
transcription modules we introduced in Section III-A, we compared the effects of non-transcribed methods and transcribed
methods respectively on the trained multimodal emotion analysis model. In the case of the non-transcribed method, we used
the title of the short video crawled as the input of textual
modality. We transferred the multimodal emotion analysis
model trained on the CMU-MOSEI dataset (label marked -3 -1
as negative and label marked 1-3 as positive) to the bili-news
dataset for experiments. The textual inputs for the experiments
are the titles we crawled and the subtitles transcribed by
our method respectively. Using the text generated by this
module, instead of using the title, can improve the accuracy
of the multimodal emotion analysis model by up to about
10.01%, as evidenced by the experimental results in Table
V.
In order to validate the effectiveness of the automatic segmentation and transcription method on end-to-end short video

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. XX, NO. X, DECEMBER 2023

7

TABLE IV
S TATISTICAL ANALYSIS OF BILI - NEWS DATASET
Label

Duration

Language

TABLE V
T HE RESULT OF DIFFERENT TEXTUAL INPUT ( TITLE AND TRANSCRIPTION )

ACC-2 (%) Precision (%) Recall (%) F1 (%)
Non-transcribed
(Title)
Transcribed
(Transcription)

74.82

51.82

86.36

64.77

82.31

55.37

89.33

68.37

TABLE VI
T HE RESULT OF DIFFERENT TEXTUAL INPUT ( MANUAL TRANSCRIPTION
AND AUTOMATIC TRANSCRIPTION )

Transcription ACC-2 (%) Precision (%) Recall (%) F1 (%)
Manual
Automatic

82.08
82.91

78.17
83.35

88.81
82.06

83.15
82.70

multimodal emotion analysis, we compared the performance
of subtitles by manually transcribed and subtitles generated
by this method on the optimal multimodal emotion analysis
model trained on the CMU-MOSEI dataset.These were used as
textual modal inputs and tested on the CMU-MOSEI dataset’s
test set. The experimental results are shown in Table VI. In
terms of accuracy, there is little difference between manual
transcription and automatic transcription. In terms of precision,
automatic transcription outperforms manual transcription by
6.63%. However, automatic transcription exhibits lower performance than manual transcription in terms of recall and F1
score. The quality of manual transcription should be higher
than that of automatic transcription, so the result of manual
transcription is better than automatic transcription. Our experiments confirmed that the automatic transcription method
has a similar effect to the artificial one in terms of accuracy
and precision, and it also saves much labor costs.

C. Performance and Computational Efficiency Analysis of
V2EM-RoBERTa Model
Our experiment relies on CPU Intel (R) Xeon (R) Gold
6326 CPU @ 2.90GHz and GPU Nvidia RTX3090, only one
graphics card and CPU are used for training the model, and
only the CPU is used for inference.

Poster

1) Multimodal Emotion Analysis Experiment
We conducted optimization experiments for the text modality of the V2EM model on the IEMOCAP dataset [3] and
the CMU-MOSEI dataset [5]. On the IEMOCAP dataset, we
extract video frames at a rate of 800 frames per second.
The epoch is set to 30, the batch size is set to 1, and the
accumulation gradient is set to 4. We combine some small
language models for feature extraction of text modality into
V2EM models for optimization experiments. As the Section
III-B mentioned, we try the pretrained small language models
like Albert [31], GPT2 [32], BART [33], Distilbert [34],
RoBERTa [23]. The results are shown in Table VII. On the
MOSEI dataset, due to the long duration of some videos in the
dataset and the limitation of the graphics card, we extract a
fixed set of 10 video frames per video for the visual modality
input. Other experimental parameters remained consistent with
those of the IEMOCAP dataset, and the results are shown in
Table VIII. We found that using a pre-trained RoBERTa
text model [23] improved accuracy by approximately
4.17% compared to the base model. However, the training
time is longer because the number of parameters of the
RoBERTa model is larger than that of Albert.
2) Textual Modal Emotion Analysis Experiment
In order to validate that RoBERTa model has a better
effect than Albert model and other small language models,
we conducted a textual modal experiment on the IEMOCAP
dataset. The experimental parameters are set the same as the
previous experiments, and the experimental results are shown
in Table IX. The results show that the RoBERTa model has
the best performance on text modality.
3) Multimodal Experiments Integrated with Large Language
Models
As mentioned in Section III-B, we also conducted experiments using some state-of-the-art large language models like
bloomz [35], mt0 [35], flan-t5 [36] on the IEMOCAP dataset.
In these experiments, we used a unified prompt command for
the pre-trained large language models to inference in the text
modality, while other modalities were trained and tested in the
same way as the V2EM model. The results are shown in Table
X. We can see that using large language models not only takes
longer time for training but also the results are not as good as
small language models. Therefore, for current multimodal
emotion analysis tasks, employing small language models
proves to be better.

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. XX, NO. X, DECEMBER 2023

8

TABLE VII
T HE RESULT OF MULTIMODAL EMOTION ANALYSIS WITH DIFFERENT SMALL LANGUAGE MODELS FOR TEXTUAL FEATURE EXTRACTION BASED ON THE
V2EM MODEL ON THE IEMOCAP DATASET

Text Model

ACC-2 (%)

Recall (%)

Precision (%)

F1 (%)

AUC (%)

Parameters

Training Time

Albert-base-v2
GPT2
BART
distilbert-base-uncased
RoBERTa-base

0.8023
0.7508
0.7833
0.8023
0.8372

0.6696
0.5527
0.5958
0.5953
0.6585

0.4515
0.3447
0.4076
0.4504
0.5208

0.5335
0.4234
0.4767
0.5043
0.5755

0.8412
0.7416
0.7951
0.8070
0.8587

11M
137M
139M
67M
125M

8.89h
8.56h
9.2h
8.30h
9.08h

TABLE VIII
T HE RESULT OF MULTIMODAL EMOTION ANALYSIS EXPERIMENTS WITH DIFFERENT SMALL LANGUAGE MODELS FOR TEXTUAL FEATURE EXTRACTION
BASED ON THE V2EM MODEL ON THE CMU-MOSEI DATASET

Text Model

ACC-2 (%)

Recall (%)

Precision (%)

F1 (%)

AUC (%)

Parameters

Training Time

Albert-base-v2
GPT2
BART
distilbert-base-uncased
RoBERTa-base

0.7141
0.6659
0.6995
0.7270
0.7328

0.6137
0.5617
0.6088
0.5686
0.6142

0.3651
0.3046
0.3596
0.3716
0.3933

0.4553
0.3935
0.4417
0.4431
0.4722

0.7254
0.6538
0.7951
0.7187
0.7437

11M
137M
139M
67M
125M

37.28h
36.89h
38.45h
38.30h
38.40h

TABLE IX
T HE RESULT OF TEXTUAL MODEL EMOTION ANALYSIS EXPERIMENTS WITH DIFFERENT SMALL LANGUAGE MODELS FOR THE TEXTUAL FEATURE
EXTRACTION ON THE IEMOCAP DATASET )

Text Model

ACC-2 (%)

Recall (%)

Precision (%)

F1 (%)

AUC (%)

Parameters

Training Time

Albert-base-v2
GPT2
BART
distilbert-base-uncased
RoBERTa-base

0.8087
0.6707
0.8083
0.8148
0.8462

0.5681
0.5906
0.5758
0.6129
0.5903

0.4483
0.2961
0.4472
0.4877
0.5354

0.4906
0.3793
0.4945
0.5300
0.5595

0.8027
0.6908
0.8168
0.8382
0.8442

11M
137M
139M
67M
125M

3.0h
2.6h
3.76h
1.71h
2.78h

TABLE X
T HE RESULT OF MULTIMODAL EMOTION ANALYSIS EXPERIMENTS INTEGRATED WITH LARGE LANGUAGE MODELS AND SMALL LANGUAGE MODELS ON
THE IEMOCAP DATASET )

Text Model

ACC-2 (%)

Recall (%)

Precision (%)

F1 (%)

AUC (%)

Parameters

Training Time

Albert-base-v2 (SLM)
RoBERTa-base (SLM)
bloomz-560m (LLM)
mt0-base (LLM)
flan-t5-base (LLM)

0.8023
0.8372
0.7530
0.7411
0.7483

0.6696
0.6585
0.5850
0.6487
0.5848

0.4515
0.5208
0.3766
0.3668
0.3531

0.5335
0.5755
0.4497
0.4612
0.4349

0.8412
0.8587
0.7601
0.7690
0.7492

11M
125M
560M
580M
248M

8.89h
2.78h
9.53h
13.07h
11.10h

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. XX, NO. X, DECEMBER 2023

9

Fig. 7. The interface of the multimodal emotion analysis for short videos
Fig. 8. The interface of the temporal emotion analysis for short videos

D. The Test of the MSEVA System Analysis
1) Comprehensive Emotion Analysis Interface for Short
Videos
Clicking the emotion analysis button allows us to do
real-time emotion analysis for short videos and make a
complete appraisal of emotions. In this test, we inputted
the short video titled “The female anchor tearfully recalls the
fear of the interview” in the bili-news dataset. As shown in
Figure 7, the interface first shows the emotion of the video,
indicating whether it is positive or negative. Subsequently, it
shows specific scores for different emotions, with the highest
score being the final result. For this example, the result is sad,
which aligns with our subjective judgment and the label.
To provide finer analysis, our proposed system performs
temporal analysis, as emotions in a short video may
vary partially throughout its duration. Leveraging the endto-end multimodal emotion analysis module based on the
pretrianed V2EM-RoBERTa model, we automatically segment
the short video into sentences, feeding them into the module to
inference for each sentence. We utilize an emotion fluctuation
graph for visual representation, enhancing the comprehension
of the short video’s emotional trajectory. The interface is
depicted in Figure 8.
In addition, the system offers emotion analysis results
for each modality, allowing the analysis of short video
emotions from various perspectives. We utilize a decisionlevel fusion approach in the final result, which means we
linearly combine these results of individual modality to get
the final result. We show the emotional analysis result for each
modality in Figure 9.
2) The Performance of the MSEVA System
We tested the emotion analysis of the MSEVA system on the
bili-news dataset. The dataset consisted of 147 short videos,
including 62 negative videos and 85 positive videos. The result
is shown as Table XI. By comparing the system’s emotion
analysis results with the label in the bili-news dataset, we
computed the accuracy and F1 score. They were 76.2%
and 81.5%, respectively.
The analysis error of the MSEVA system appears in the case
that the subject criticizes and warns against negative behaviors

Fig. 9. The interface of each modality emotion analysis for short videos

using a humorous way or a lighthearted broadcasting method.
In this case, even though the subject’s speech contains negative
vocabulary such as harm, prohibition, or punishment, the video
is still recognized as a positive video by the MSEVA system.
When the subject’s speaking style, tone, and content are
consistent, the model attains more accurate recognition.
The wrong case: There is no star when it comes to legal
issues, “Deng Lun” needs to speak and act with caution in
life.
Case study: The main content of this short video is about
the news that the star Deng Lun was fined for tax evasion.
Although the anchor’s broadcast style is very serious, and the

TABLE XI
T HE PERFORMANCE OF THE EMOTION ANALYSIS OF THE MSEVA SYSTEM

Ground Truth

The result of system
Positive
Negative

Positive

77

8

Negative

27

35

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. XX, NO. X, DECEMBER 2023

TABLE XII
V IDEO SCREENSHOTS OF THESE CASES ( WRONG CASE ON THE LEFT,
CORRECT CASE ON THE RIGHT )
Wrong Case

Correct Case

10

There are some limitations in this work. The relatively
limited number of short videos in the bili-news dataset could
be expanded while ensuring data standardization. In the future,
we can fine-tune the multimodal emotion analysis model
on the expanded bili-news dataset so that we can enhance
the model’s performance. The performance of the MSVEA
system needs to be further improved and to be more available
for the real short videos on the platforms. Moreover, the
computational time of the current system is high and requires
further optimization.
ACKNOWLEDGMENTS
This study was supported by the National Social Science
Foundation of China (No. 62301510), the Fundamental Research Funds for the Central Universities (No. CUC23GZ005),
the Fundamental Research Funds for the Central Universities
(No. CUC23ZDTJ004).
R EFERENCES

speaking content is also a relatively heavy topic, the audience
will have the thought that he deserves his punishment, which
is positive, so the analysis result of the model is wrong.
The correct case: Ouyang Xiadan: The suspect who beat a
9-year-old boy to death was detained, and the mental problem
is not “immunity”.
Case study: The short video is mainly about the suspect
of the violent incident suffering from mental illness. In view
of this social problem, the anchor calls for strengthening the
treatment and control of patients with mental illness to prevent
them from causing serious social problems. The emotion of
the short video is negative, and the result of the MSEVA model
is correct.
Table XII shows the video screenshot of these cases.
V. C ONCLUSION AND F UTURE W ORK
In this paper, we propose the methods of automatic segmentation and transcription method which supports multilingual
videos and improve the efficiency of the multimodal dataset
construction. This method improves the usability of short
video emotion analysis in our life. We firstly construct the
multimodal emotion analysis dataset bili-news based on short
videos, which includes the annotation for the overall emotion
of short videos. This dataset is openly accessible. In addition, we achieved approximately 4.17% improvement in the
accuracy of the multimodal emotion analysis model based on
V2EM [2]. We conducted several relevant experiments on pretrained small language models and the current large language
models, validating the importance of small language models
for multimodal emotion analysis and that large language
models cannot completely replace small language models now.
Finally, we propose the MSEVA system, designed for endto-end visual analysis of short videos. This system utilized
a multimodel emotion analysis model trained on the CMUMOSEI dataset and was tested on the bili-news dataset. The
results of the experiments show that the system is effective
and significant in real-life applications.

[1]

Shuai Yang, Yuzhen Zhao, and Yifang Ma. “Analysis of
the reasons and development of short video applicationTaking Tik Tok as an example”. In: Proceedings of the
2019 9th International Conference on Information and
Social Science (ICISS 2019), Manila, Philippines. 2019,
pp. 12–14.
[2] Qinglan Wei, Xuling Huang, and Yuan Zhang. “FV2ES:
A Fully End2End Multimodal System for Fast Yet Effective Video Emotion Recognition Inference”. In: IEEE
Transactions on Broadcasting 69.1 (2022), pp. 10–20.
[3] Carlos Busso et al. “IEMOCAP: Interactive emotional
dyadic motion capture database”. In: Language resources and evaluation 42 (2008), pp. 335–359.
[4] Amir Zadeh et al. “CMU-MOSEAS: A multimodal
language dataset for Spanish, Portuguese, German and
French”. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing. Vol. 2020. NIH Public Access. 2020, p. 1801.
[5] AmirAli Bagher Zadeh et al. “Multimodal language
analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph”. In: Proceedings of the
56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018,
pp. 2236–2246.
[6] Amir Zadeh. “Micro-opinion sentiment intensity analysis and summarization in online videos”. In: Proceedings of the 2015 ACM on International Conference on
Multimodal Interaction. 2015, pp. 587–591.
[7] Md Kamrul Hasan et al. “UR-FUNNY: A multimodal
language dataset for understanding humor”. In: arXiv
preprint arXiv:1904.06618 (2019).
[8] Wenmeng Yu et al. “Ch-sims: A chinese multimodal
sentiment analysis dataset with fine-grained annotation
of modality”. In: Proceedings of the 58th annual meeting of the association for computational linguistics.
2020, pp. 3718–3727.

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. XX, NO. X, DECEMBER 2023

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

Miriam Redi et al. “6 seconds of sound and vision:
Creativity in micro-videos”. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern
Recognition. 2014, pp. 4272–4279.
Jianglong Zhang et al. “Shorter-is-better: Venue category estimation from micro-video”. In: Proceedings of
the 24th ACM international conference on Multimedia.
2016, pp. 1415–1424.
Jingyuan Chen et al. “Micro tells macro: Predicting the
popularity of micro-videos via a transductive model”.
In: Proceedings of the 24th ACM international conference on Multimedia. 2016, pp. 898–907.
Zhulin Tao et al. “Mgat: Multimodal graph attention
network for recommendation”. In: Information Processing & Management 57.5 (2020), p. 102277.
Peng Qi et al. “FakeSV: A multimodal benchmark
with rich social context for fake news detection on
short video platforms”. In: Proceedings of the AAAI
Conference on Artificial Intelligence. Vol. 37. 12. 2023,
pp. 14444–14452.
Hang-Bong Kang. “Affective content detection using
HMMs”. In: Proceedings of the eleventh ACM international conference on Multimedia. 2003, pp. 259–262.
Hee Lin Wang and Loong-Fah Cheong. “Affective understanding in film”. In: IEEE Transactions on circuits
and systems for video technology 16.6 (2006), pp. 689–
704.
Martin Wöllmer et al. “Youtube movie reviews: Sentiment analysis in an audio-visual context”. In: IEEE
Intelligent Systems 28.3 (2013), pp. 46–53.
Ha-Nguyen Tran and Erik Cambria. “Ensemble application of ELM and GPU for real-time multimodal
sentiment analysis”. In: Memetic Computing 10 (2018),
pp. 3–13.
Brendan Jou, Subhabrata Bhattacharya, and Shih-Fu
Chang. “Predicting viewer perceived emotions in animated GIFs”. In: Proceedings of the 22nd ACM international conference on Multimedia. 2014, pp. 213–216.
Zhengyuan Yang, Yixuan Zhang, and Jiebo Luo.
“Human-centered emotion recognition in animated
gifs”. In: 2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE. 2019, pp. 1090–1095.
Sicheng Zhao et al. “An end-to-end visual-audio attention network for emotion recognition in user-generated
videos”. In: Proceedings of the AAAI Conference on
Artificial Intelligence. Vol. 34. 01. 2020, pp. 303–311.
Wenliang Dai et al. “Multimodal end-to-end sparse
model for emotion recognition”. In: arXiv preprint
arXiv:2103.09666 (2021).
Alec Radford et al. “Robust speech recognition via
large-scale weak supervision”. In: International Conference on Machine Learning. PMLR. 2023, pp. 28492–
28518.
Yinhan Liu et al. “Roberta: A robustly optimized bert pretraining approach”. In: arXiv preprint
arXiv:1907.11692 (2019).
Cheng Peng et al. “Cross-modal complementary network with hierarchical fusion for multimodal sentiment

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]
[33]

[34]

[35]

[36]

11

classification”. In: Tsinghua Science and Technology
27.4 (2021), pp. 664–679.
Sanghyun Lee, David K Han, and Hanseok Ko. “Multimodal emotion recognition fusion analysis adapting
BERT with heterogeneous feature unification”. In: IEEE
Access 9 (2021), pp. 94557–94572.
Ziming Li et al. “AMOA: Global acoustic feature
enhanced modal-order-aware network for multimodal
sentiment analysis”. In: Proceedings of the 29th International Conference on Computational Linguistics.
2022, pp. 7136–7146.
Anupama Ray et al. “A multimodal corpus for
emotion recognition in sarcasm”. In: arXiv preprint
arXiv:2206.02119 (2022).
Ziming Li et al. “QAP: A Quantum-Inspired AdaptivePriority-Learning Model for Multimodal Emotion
Recognition”. In: Findings of the Association for Computational Linguistics: ACL 2023. 2023, pp. 12191–
12204.
Di Wang et al. “TETFN: A text enhanced transformer
fusion network for multimodal sentiment analysis”. In:
Pattern Recognition 136 (2023), p. 109259.
Wenxuan Zhang et al. “Sentiment Analysis in the Era of
Large Language Models: A Reality Check”. In: arXiv
preprint arXiv:2305.15005 (2023).
Zhenzhong Lan et al. “Albert: A lite bert for selfsupervised learning of language representations”. In:
arXiv preprint arXiv:1909.11942 (2019).
Alec Radford et al. “Language models are unsupervised
multitask learners”. In: OpenAI blog 1.8 (2019), p. 9.
Mike Lewis et al. “Bart: Denoising sequence-tosequence pre-training for natural language generation,
translation, and comprehension”. In: arXiv preprint
arXiv:1910.13461 (2019).
Victor Sanh et al. “DistilBERT, a distilled version of
BERT: smaller, faster, cheaper and lighter”. In: arXiv
preprint arXiv:1910.01108 (2019).
Niklas Muennighoff et al. “Crosslingual generalization through multitask finetuning”. In: arXiv preprint
arXiv:2211.01786 (2022).
Hyung Won Chung et al. “Scaling instruction-finetuned
language models”. In: arXiv preprint arXiv:2210.11416
(2022).