TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes
Bo-Wen Yin Yuming Chen Zheng Lin Yunheng Li
Qibin Hou* Ming-Ming Cheng
VCIP, School of Computer Science, Nankai University

arXiv:2312.04248v1 [cs.CV] 7 Dec 2023

Xuying Zhang

Abstract
Recent progress in the text-driven 3D stylization of a single object has been considerably promoted by CLIP-based
methods. However, the stylization of multi-object 3D scenes
is still impeded in that the image-text pairs used for pretraining CLIP mostly consist of an object. Meanwhile, the
local details of multiple objects may be susceptible to omission due to the existing supervision manner primarily relying on coarse-grained contrast of image-text pairs. To
overcome these challenges, we present a novel framework,
dubbed TeMO, to parse multi-object 3D scenes and edit
their styles under the contrast supervision at multiple levels.
We first propose a Decoupled Graph Attention (DGA) module to distinguishably reinforce the features of 3D surface
points. Particularly, a cross-modal graph is constructed to
align the object points accurately and noun phrases decoupled from the 3D mesh and textual description. Then, we
develop a Cross-Grained Contrast (CGC) supervision system, where a fine-grained loss between the words in the
textual description and the randomly rendered images are
constructed to complement the coarse-grained loss. Extensive experiments show that our method can synthesize highquality stylized content and outperform the existing methods
over a wide range of multi-object 3D meshes. Our code and
results will be made publicly available.

Bare Mesh

TANGO

Our TeMO

two dragons

“a fire dragon and an ice dragon”

person & dragon

“an iron man and an ice dragon”

Figure 1. Visual comparisons between the existing 3D stylization methods (e.g., TANGO [14]) and our TeMO in multi-object
scenes. For a scene with multiple objects of the same/different categories, existing methods are prone to interference between different properties of the objects, while our TeMO is able to accurately
synthesize the desired stylized content for each object.

model, we choose to work with text-driven 3D stylization.
Recent years have witnessed the emergence of a series
of impressive works [7, 14, 20, 22], aiming to drive the advancement of text-driven 3D stylization. Existing methods
usually adopt multi-layer perceptrons (MLPs) to predict the
location attribute displacements of the bare mesh under the
supervision of the contrastive loss in CLIP. We observe that
these works focus on the stylization of a single 3D object
and perform poorly on multiple objects, as shown in the
second column of Fig. 1. We argue that two inherent characteristics of CLIP result in this issue: i) CLIP is mainly pretrained with image-text pairs mostly consisting of a single
object; ii) CLIP loss employs global representation vectors
from images and text to coarsely match these two modalities, which inevitably causes the loss of local details. Moreover, the key to synthesizing desired styles for multiple 3D
objects lies in the parsing of such 3D scenes and the multigrained supervision for details refinement.
To simultaneously generate stylized content for multiple
3D objects, the primary step is to achieve accurate align-

1. Introduction
3D asset creation through stylization aims to synthesize
stylized content on the bare meshes to conform to the given
text descriptions [14, 22], referring images [34, 45], or 3D
shapes [42]. This research plays an important role in a
wide spectrum of applications, e.g., virtual/augmented reality [4, 8], gaming industries [43], and robotics [12]. Moreover, it also presents considerable potential and has attracted
increasing attention in computer vision and graphics communities. Considering the ready availability and expressiveness of text prompts as well as the popularity of largescale Contrastive Language-Image Pre-training (CLIP) [25]
* Qibin Hou is the corresponding author

1

ment between the objects in the 3D mesh and the target text.
However, existing methods employ global semantics of the
text to stylize a single object, which inevitably produces
noises when stylizing the objects in multi-object scenes.
To overcome this challenge, we propose to parse the 3D
scene by introducing a Decoupled Graph Attention (DGA)
module. Specifically, all noun phrases are decoupled from
the text prompt and the mesh surface points of the current
view are divided into several clusters as well. Then, a crossmodal graph is constructed to establish connections of the
noun phrases to their corresponding object points while distancing them from the irrelevant ones. This graph enables
the accurate interaction between two interrelated modalities. Finally, the surface point features of 3D objects can be
reinforced by independent cross-attention fusions with their
neighboring word nodes in the graph architecture.
Furthermore, we also design a Cross-Grained Contrast
(CGC) loss to perform comprehensive cross-modal supervision for the stylization of multiple objects. The goal is
to guide the network to generate more stylization details
for multiple 3D objects to match the target text. Our loss
consists of two parts, i.e., coarse-grained contrast and finegrained contrast. In the former part, the text prompt is regarded as sentence-level supervision, which calculates the
similarity between the 2D views rendered from the stylized
3D mesh and the text prompt using the global feature vectors from the CLIP model. In the latter part, we see the text
prompt from the word level, and consider the similarities
between each word of the sentence and the rendered images
of the view sets. To be specific, we produce the word representations of the text prompt by taking the hidden states
from the text encoder of CLIP. Motivated by the recent process in video-text retrieval [19], we calculate fine-grained
loss via the weighted summation of the element in similarity vectors based on the importance of each word or image.
Based on the well-designed DGA module and CGC loss,
we propose a novel framework towards Text-Driven 3D
stylization for Multi-Object Meshes, called TeMO. To validate the effectiveness of our TeMO, extensive experiments
are conducted on various multi-object 3D scenes, as shown
in the 3rd column of Fig. 1. The experimental results
demonstrate our TeMO is less susceptible to interference
from multiple objects and can generate superior stylized assets compared with the existing 3D stylization methods.
Our contributions can be summarized as follows:
• We present a new 3D stylization framework, called
TeMO. To the best of our knowledge, it is the first attempt to parse the objects in the text and 3D meshes and
generate stylizations for multi-object scenes.
• We propose a Decoupled Graph Attention (DGA) module, which constructs a graph structure to align the surface points in the multi-object mesh and the noun phrases
in the text prompt.

• We design a Cross-Grained Contrast (CGC) loss, in
which the text is contrasted with the rendered images
from sentence and word levels.

2. Related Work
2.1. Text-Driven 3D Manipulation
Generating or editing 3D content according to a given
prompt is a long-standing objective in computer vision and
graphics. Among all forms of the prompt, the text has garnered the most conspicuous gaze due to three reasons: i)
Text descriptions are readily accessible from the existing
corpus; ii) Text descriptions are particularly user-friendly
since they are easily modifiable and can effectively express
complex concepts related to stylizations; iii) The popularity
of large-scale CLIP [25] model has made achieving visuallanguage supervision possible.
Text2Mesh [22] proposes a neural-style field network
to predict the color and displacement of mesh vertices.
TANGO [14] proposes to disentangle the appearance style
as the spatially varying bidirectional reflectance distribution, the local geometric variation, and the lighting condition. Then, X-Mesh [20] integrates the target text guidance
by utilizing text-relevant spatial and channel-wise attention
during vertex feature extraction. Motivated by the remarkable progress in text-driven 2D generation [26, 28], TEXTure [27] and Text2Tex [6] incorporate a pre-trained depthaware image diffusion model to synthesize high-resolution
partial textures from multiple viewpoints progressively.
To make full use of the priors in the pre-trained 2D textto-image diffusion model, DreamFusion [24] introduces a
Score Distillation Sample (SDS) loss to perform text-to-3D
synthesis. With the help of SDS loss, Latent-NeRF [21]
and Fantasia3D [7] can generate 3D shapes and appearances for 3D objects. Despite achieving impressive results,
these methods focus on the stylization of a single 3D object
and rarely explore multi-object scenes. CLIP-Mesh [23] attempts to generate multiple 3D objects for target text. Nevertheless, the resulting content is not satisfactory. In this
paper, we parse the objects described in rendered images
and text prompts, aligned by two well-designed strategies.

2.2. Attention Mechanism
The concept of the attention mechanism was initially introduced in neural machine translation [1], where the weighted
summation of the candidate vectors is calculated according to their importance scores. This technology has been
extended to a myriad of tasks, e.g., natural language processing [9, 18, 33], computer vision [13, 15, 40, 48], and
multi-modal learning [17, 36, 41, 47]. For instance, Transformer [33] employs the self-attention operation to establish connections between words within a sentence and utilize the cross-attention mechanism to align source and tar2

“a superman, a
fire dragon, and
an ice whale”

FT ∈ Rdim

FI ∈ Rdim

Avg

CGC

Text
Encoder

Loss
T ∈ Rm×dim

Text Prompt

I ∈ Rn×dim
DGA

Ray Casting

𝑥𝑝

⋮

Image
Encoder

Diffuse
Roughness
Specular

⋮

Rendering
Lighting

𝒇𝒓 ∙
DGA

𝑛𝑝

⋮

⋮

Normal
𝒇𝒏 ∙

Mesh

Figure 2. The overall architecture of the proposed TeMO framework. We first specify several cameras to cast rays toward the objects in
the 3D mesh scene. Then, a surface point xp and normal np can be attained from each ray intersected with the objects. These points
and normals are fed to the attribute prediction network where the features of 3D objects are parsed and interacted with the decoupled
text features via our proposed DGA module. Meanwhile, we employ a series of spherical Gaussians to represent the lighting. Finally, a
differentiable SG render is adopted to render images, which are utilized to contrast with the text prompt by our designed CGC loss.

such coarse-grained supervision. In this paper, we propose
a cross-grained supervision strategy, which considers finegrained and coarse-grained contrasts to achieve a more precise semantic alignment between rendered image and text.

get sentences. Non-local network [35] takes the lead in
introducing self-attention to computer vision and achieves
great success in video understanding and object detection.
ViT [10] treats an image as a sequence of patches and employs a Transformer encoder based on self-attention to perform image classification. Swin Transformer [15] introduces shifted windows to enhance the local perception ability of self-attention. More recently, X-Mesh [20] designs a
text-guided dynamic attention mechanism for vertex feature
extraction of a 3D object. However, this guidance only relies on a text feature vector without considering the parsing
of text and 3D scenes. In this paper, the multiple objects
decoupled from the target text and 3D mesh are aligned via
a cross-modal graph to achieve precise guidance.

3. Methodology
3.1. Overall Architecture
Fig. 2 shows the end-to-end architecture of our TeMO
framework. Given a bare mesh and a text prompt containing multiple objects, the TeMO aims to synthesize stylization on the mesh to match the text descriptions. We employ
a set of vertices V ∈ Re×3 and faces F ∈ {1, ..., e}u×1
to explicitly define the input triangle mesh, which is fixed
throughout the training. Following TANGO [14], we disentangle the appearance style as the spatially varying bidirectional reflectance distribution function [3, 44, 46] (including
diffuse, roughness, specular), the local geometric variation
(normal map), and the lighting condition.
We start by normalizing the vertex coordinates to lie
inside a unit sphere. Then, we randomly sample points
around the mesh using Gaussian distribution as camera positions to render images. Next, we can obtain a camera ray
Rp = {c + tνp } from the sampled camera position c and
a pixel p in rendered images, where νp is the direction of
the ray. Further, ray casting [29] is used to seek out the
ray and mesh’s first intersection point and intersection face.
Moreover, the normal np ∈ R3 of the intersection face is
employed as surface normal at the point xp ∈ R3 .
To achieve multiview-consistent features, our TeMO is
restricted to predicting the normal displacement as a function of the location, while allowing the color materials
to be predicted as a function of both location and view-

2.3. Multi-modal Contrastive Learning
Contrastive learning has become an increasingly popular research topic in the multi-modal community due to its ability
to align different modal representations. Based on this strategy, CLIP [25] is pre-trained on an abundance of imagetext pairs, achieving great success in cross-modal supervision. TACo [38] presents a token-aware cascade contrastive learning based on the syntactic classes of words to
achieve fine-grained semantic alignment in text-video retrieval. Concurrently, FILIP [39] proposes comparing the
image patches with the words in the sentence. Regarding
the text-driven 3D stylization, the CLIP loss, which calculates the similarity between the image and text vectors in the
embedding space of CLIP, is adopted by the vast majority
of methods. Although achieving impressive results in stylizing a single object, these methods cannot be well adapted
to scenes with multiple 3D objects. We argue an important
reason for this issue is the loss of local details caused by
3

Inputs

Scene Parsing

Object Alignment Between Different Modalities
𝑤11

𝑥𝑝

⋮
𝑥1
⋮

⋮

𝑥𝑙

⋮

𝑥𝑝

“a blue whale”

“a blue whale, an ice
dragon with red eyes,
and an iron man”
Text Prompt

“an ice dragon with red eyes”
“an iron man”

“a blue whale”

𝑤24

⋮

Mesh

“an ice dragon with red eyes”

𝑤22

“an iron man”

⋮

𝑤33

Location Nodes

⋮
Word Nodes

Figure 3. Construction pipeline of the cross-modal graph architecture in our DGA module. Note that xp , the surface point of the 3D
objects, and wij , the j-th word in the i-th noun phrase, are connected together only if they correspond to the same object.

point set {x1 , ..., xp , ...} of the current ray and the mesh.
Meanwhile, we can obtain a binary map of objects in the
current view based on whether the ray intersects with the
mesh. Further, we can decouple the objects in the binary
map according to the clustered points and acquire several
binary maps of individual objects. Based on the disentangled noun phrases and binary maps for multiple objects, we
can match the correct pairs by their semantic similarities.
As a result, the objects described in the text are aligned with
their corresponding objects in the mesh, which are utilized
to construct a cross-modal graph G = (V, E), as shown in
Fig. 3. To be specific, all surface point features and word
features are considered as independent nodes to form the
node set V. For the edge set V, the link between the surface
point node and the word node will be built if the semantic
objects they belong to are the same.

ing direction. Therefore, our TeMO represented as MLPs
includes two branches, i.e., normal branch fn (·) and reflectance branch fr (·). Specifically, the former is utilized
as the prediction of normal offset on the point xp , and the
latter is designed to predict surface reflectance coefficients
of the material at the location xp , i.e., diffuse, roughness,
and specular. To synthesize high-frequency details, we also
apply the Fourier positional encoding [31] to every input.
In addition, the spherical Gaussian is employed to represent
each light intensity Li (·) due to its closed-form nature and
analytical solution. Based on the attained geometric and
appearance components, each pixel color in the rendered
image can be calculated by a hemisphere renderer [14]:
Z
Lp (νp , xp , np ) = Li (wi )fr (νp ,wi ,xp )(wi · n̂p )dwi , (1)
Ω

n̂p = np + fn (xp , np ),

(2)

Under the setting of this cross-modal graph, we can individually perform cross-attention between the surface point
nodes and their neighboring word nodes, where the parsed
surface point features are used as queries, and the parsed
text features serve as keys and values. The enhancement of
the surface point node vi ∈ Rdim can be formulated as:

where Ω = {wi : wi · n̂p ≥ 0} represents the hemisphere,
wi is the incident light direction, and n̂p is the estimated
normal on surface point xp .

3.2. Decoupled Graph Attention
To achieve text-drive stylization for multiple 3D objects,
the key issue that needs to be solved is the accurate alignment between the objects described in the text and those
in the meshes. X-Mesh [20] has incorporated text-guided
dynamic linear layers, in which the global representation
vector of the target object in the text is utilized as guidance to acquire text-aware vertex features. Nevertheless,
the global vector containing information about multiple objects is prone to mutual interference and produces semantic
noises during guidance for multi-object scenes.
To address this challenge, we propose to parse the objects in the text and mesh. We first extract the noun phrases
modified by adjectives or prepositional phrases from the
text using the NLTK tools [2]. Then, we employ the Gaussian Mixture Model (GMM) [50] to cluster the intersection

X

v̂i =

αij Linear(vj ),

(3)

vj ∈Adj(vi )

αij = P

eWij

vj ∈Adj(vi ) e

Wij =

Wij

,

Linear(vi )Linear(vj )T
√
,
dl

(4)
(5)

where Adj(vi ) is the adjacency nodes of vi and Linear(·)
represents a linear transformation. With this attention
mechanism, the surface point features of different objects in
the mesh can be distinguishably reinforced under the guidance of the word features in the parsed text.
4

3.3. Cross-Grained Contrast Supervision

where the weights are defined as the degree of correlation
between the central and another modality. Finally, we adopt
the average value of these two scores as the fine-grained
contrast loss, which can be defined as:

To guide the optimization of the neural network for 3D stylization, the first step is to render the stylized 3D mesh from
multiple 2D views. Most existing methods usually employ
the visual encoder and text encoder of CLIP [25] to extract
global feature vectors for the rendered image and target text,
respectively, which are contrasted to perform cross-modal
supervision via cosine similarity:
Lcoarse = −

FI · F T
,
∥FI ∥2 ∥FT ∥2

Lf ine = −(LI + LT )/2.

The coarse-grained and fine-grained contrast supervision
complement each other to build a cross-grained contrast supervision system. The former is utilized to align the global
semantic information of the target text with the 3D objects,
and the latter is used to achieve the local semantic alignment. This loss can be defined as:

(6)

where FI ∈ R512 is the averaged feature vector of the images rendered from different views, FT ∈ R512 denotes the
global feature vector of the target text, and ∥ · ∥2 represents
the Euclidean norm function.
Although achieving impressive results for stylizing a single 3D object, these methods still have limitations in multiobject scenes. Considering that a single feature vector still
represents a sentence describing multiple objects, the object details may be lost in large amounts. Therefore, such a
coarse-grained contrast supervision is insufficient to guide
the neural network in synthesizing photorealistic stylized
content for multiple 3D objects.
To solve this issue, we construct a fine-grained contrast
supervision to complement the coarse-grained one. Specifically, we first calculate the correlation map, i.e., S ∈ Rn×m
between the word features in the text and the visual features
of the rendered images, which are also extracted from the
text encoder and visual encoder of the CLIP:
S=

I ·TT
,
∥I∥2 ∥T ∥2

Lcgcs = λc Lcoarse + λf Lf ine ,

LT =

n
X

eSI (i)
Pn
S (i),
SI (k) I
k=1 e
i=1

m
X

eST (j)
Pm S (k) ST (j),
T
k=1 e
j=1

(13)

where λc and λf are two hyper-parameters to balance the
cross-grained and the fine-grained losses, set to 1.0 and
0.33, respectively.

4. Experiments
4.1. Experiment Setup
Datasets. To examine our method across a diverse set of 3D
scenes, we first collect 3D object meshes from a variety of
sources, i.e., COSEG [30], Thingi10K [49], Shapenet [5],
Turbo Squid [32], and ModelNet [37]. Then, we randomly
place several objects from the collected 3D set into a mesh
using Blender. Note that we down-sample the number of
meshes’ vertices and faces to ensure the robustness of our
TeMO to low-quality meshes and reduce the burden of GPU
during the stylization. The meshes used in this paper contain an average of 79,303 faces, 16% non-manifold edges,
0.2% non-manifold vertices, and 12% boundaries.

(7)

where I ∈ Rn×512 represents the features of the images
rendered from n views, T ∈ Rm×512 indicates the features
of m words in the text. Then, we normalize the correlation
matrix along the image axis and the text axis, respectively,
to retrieve the text of interest and visual components. This
process can be formulated as:
Pm
S(i, k)
SI (i) = k=1
,
(8)
m
Pn
S(k, j)
ST (j) = k=1
.
(9)
n
Inspired by [19], we further calculate an image-centered
fine-grained contrast score and a text-centered fine-grained
contrast score by the weighted summation of the similarity
vectors, which can be formulated as follows:
LI =

(12)

Implementation Details. Following the TANGO [14] network, we adopt 3 linear layers with 256 dimensions to build
the normal estimation branch. In the reflectance branch, the
point features are extracted by 2 shared layers with 256 dimensions, followed by 3 exclusive layers to predict diffuse,
specular, and roughness. The dimension of our DGA module is also set as 256. The word features in our DGA module
are extracted from the text encoder of CLIP, and so are the
ones in our CGC loss. We choose ViT-B/32 as the backbone
of the pre-trained CLIP model in this paper, which is consistent with previous works [14, 20, 22]. We also process the
rendered images with 2D augmentation strategies [11, 14]
before feeding them into the pre-trained CLIP model. Our
TeMO model is optimized with the AdamW [16] strategy
for 1500 iterations, where the learning rate is initialized to
5 × 10−4 and decayed by 0.7 every 500 iterations. The
entire training process takes about 10 minutes on a single
NVIDIA RTX 3090 GPU.

(10)
(11)

5

person & dragon

Text Prompt: “an iron man and an ice dragon”

Text Prompt: “a superman and a fire dragon”

cat & horse

Text Prompt: “a Garfield cat and a brown horse”

Text Prompt: “a ginger cat and an astronaut horse”

vase & candle

Text Prompt: “a cactus vase and a silver candle”

Text Prompt: “a wicker vase and a candle in jeans”

Figure 4. Given the same bare mesh, our TeMO produces stylized contents of various types for multi-object scenes to conform to the
different text prompts. Please refer to supplementary materials for a detailed version.

4.2. Qualitative Evaluation

TeMO method can generate photorealistic details with fine
granularity and maintain global semantic understanding for
the given multi-object 3D scene.

We conduct visualization experiments on a wide spectrum
of multi-object scenes to verify the effectiveness of our
TeMO. However, we observe that the 3D symmetry prior
used widely in previous works [14, 22] can cause interference between different parts during the stylization process
of multiple objects. We argue that the multiple objects of
the meshes used in this paper are randomly placed to simulate a real 3D scene rather than along the z-axis. To avoid
this issue, we remove this prior in our TeMO and previous
methods involved in the comparison.

Qualitative Comparisons. We first provide the visual comparisons of prediction results between our TeMO and previous pioneering works in text-driven 3D stylization, including Text2Mesh [22], TANGO [14], and X-Mesh [20]. To
ensure a fair comparison, we adopt the official implementations of these methods and also train them with the default settings without the symmetry prior. The experimental
results show it is a real struggle for Text2Mesh [22] and
TANGO [14] to understand the detailed semantics of the
text prompt with multiple objects. As shown in the 1st row
of Fig. 5 where the 3D scene contains two objects of the
same category, given a text prompt “a fire dragon and an ice
dragon”, they tend to capture the “ice” property, missing
the “fire” property. For a 3D scene containing two objects
of different categories, they are prone to mixing the properties of these objects, as shown in the 2nd row where the text
prompt is “a wood vase and a brick candle”. Therefore, the
stylized assets they generate for these multi-object scenes
are unsatisfactory. X-Mesh generates more accurate results
that align with the text prompts, as shown in the 1st and
2nd rows, which can be attributed to incorporating the text
vector while extracting vertex features. However, it can pro-

Neural Stylization and Controls. We present the stylization results of our TeMO driven by different text prompts
for the same multi-object mesh in Fig. 4. As shown in the
1st row where the 3D scene is composed of a person and a
dragon, our TeMO can accurately distinguish between the
person object and the dragon object and appropriately stylize different body parts for them according to the semantic
roles described in each text prompt. Meanwhile, our TeMO
also synthesizes desired stylizations for the 3D objects in
the cat-horse mesh and vase-candle mesh, as shown in the
2nd and 3rd rows. Moreover, Fig. 7, Fig. 8, and Fig. 9 of
the supplementary material also show our TeMO stylizes
the entire 3D scene and generates renderings with accurate
details. These experimental results demonstrate that our
6

Bare Mesh

Text2Mesh [22]

TANGO [14]

XMesh [20]

Our TeMO

(a) Mesh: two dragons; Text Prompt: “a fire dragon and an ice dragon”.

(b) Mesh: vase & candle; Text Prompt: “a wood vase and a brick candle”.

(c) Mesh: person & dragon & whale; Text Prompt: “a superman, a fire dragon, and an ice whale”.

Figure 5. Visual comparisons of our TeMO with previous text-driven 3D stylization methods on several multi-object scenes, including two
objects of the same or different categories, and three different objects. See supplementary materials for more comparisons.

4.3. Quantitative Evaluation

duce semantic noises due to its utilization of the text vector
containing attributes of multiple objects to process all vertex features. With an increasing number of objects, it will
also encounter challenges related to comprehending text details and the alignment between the text and 3D objects, as
shown in the 3rd row. Besides, we compare our TeMO with
recent representative 3D stylization methods based on diffusion strategies [7, 21, 27], as shown in Fig. 10 of the supplementary materials. These methods still fail to generate
stylized assets without mixed properties, which can also be
attributed to the global semantic guidance of the target text
for the stylization of multiple 3D objects. In contrast, our
TeMO equipped with 3D scene parsing and multi-grained
supervision, can generate photorealistic stylized content for
each object in these 3D scenes to conform to the descriptions in the text prompts.

Objective Metric. We adopt the CLIP score to objectively
evaluate the semantic alignment achieved by our TeMO and
recent 3D stylization methods. Specifically, 8 views spaced
45◦ around the stylized meshes are chosen to obtain the rendered 2D images. Then, the visual objects are compared
with the textual objects in CLIP’s embedding space via the
cosine function. As shown in the 2nd column of Tab. 1,
our TeMO surpasses previous methods by a large margin.
These results demonstrate the superiority of our TeMO over
existing methods on multi-object stylization.
User Study. We further conduct a user study to evaluate
these 3D stylization methods subjectively. We randomly
select 10 mesh-text pairs and recruit 60 users to evaluate
7

Input Mesh

Baseline

Baseline + DGA module

Baseline + CGC loss

Our TeMO

Figure 6. Ablation experiments on the proposed designs of our TeMO. Mesh: two dragons; Text Prompt: “a fire dragon and an ice dragon”.

Table 1. Quantitative comparisons of our TeMO and previous textdriven 3D stylization methods in multi-object scenes, including an
objective alignment score (0-1) and three subjective opinion scores
(1-5). Note that the higher these metrics, the better the method.
Alignment

User-Q1

User-Q2

User-Q3

Text2Mesh [22]
TANGO [14]
X-Mesh [20]

0.262
0.274
0.265

1.750
2.406
1.839

1.506
2.450
1.722

1.472
2.539
1.761

Our TeMO

0.285

3.344

3.311

3.261

the model in generating desired stylized content for multiple 3D objects to conform to the target text.

5. Limitation and Future Work
Despite achieving excellent results on text-driven multiobject stylization, our TeMO framework still has a few limitations, which can also facilitate future research:
1) 3D Symmetry Prior. As stated in Sec. 4.2, our TeMO
fails to incorporate 3D symmetry prior, whose important
role has been demonstrated by Text2Mesh [22] in promoting style consistency of a single object. To generate more
photorealistic stylization assets for multi-object scenes, it
will be valuable to calculate symmetry planes for each object and apply symmetry priors to them.

the quality of the stylization assets generated by our TeMO
and previous methods. Particularly, the participants include
experts in the field and individuals without specific background knowledge. Moreover, each of them will be asked
three questions [22]: (Q1) “How natural is the output results?”(Q2) “How well does the output match the original
content?” (Q3) “How well does the output match the target
style?”, and then assign a score (1-5) to them. We report the
mean opinion scores in parentheses for each factor averaged
across all style outputs. As shown in Tab. 1, our TeMO still
outperforms other methods across all questions. Therefore,
the 3D assets generated by our method are more in line with
people’s understanding of the text prompts.

2) Diffusion Model. We observe that current diffusion
technologies struggle to generate multi-object images according to the text prompt, which hinders the application
of diffusion-based stylization methods in multi-object 3D
scenes. We argue it would be interesting to extend the concept of scene parsing to the diffusion models for the release
of their potential in multi-object editing or generation.

6. Conclusion

4.4. Ablation Studies

In this paper, we present TeMO, an innovative framework
proposing scene parsing and multi-grained cross-modal supervision to achieve text-driven multi-object 3D stylization
for the first time. Specifically, we first develop a DGA module to precisely align the objects in the 3D mesh and the text
prompt and enhance the 3D point features with the word
features belonging to the same object as them. Then, we
design a CGC loss, in which the fine-grained loss at the local level and coarse-grained contrast loss at the global level
are both constructed and complement each other. Further,
extensive experiments are conducted to demonstrate the effectiveness and superiority of our methods over the existing
methods among a wide range of multi-object 3D scenes. We
believe it is promising to achieve content editing of multiple objects in 3D scenes simultaneously, and we hope the
scene-parsing perspective provided by the proposed TeMO
framework will inspire future works.

To verify the effectiveness of the proposed designs in our
TeMO, we conduct ablation studies by gradually adding
them to our baseline model, i.e., TANGO [14]. We chose
the two-dragon mesh with the text prompt “a fire dragon
and an ice dragon”, and the experimental results are shown
in Fig. 6. Compared to the baseline model, introducing our
DGA module enables the model to distinguish two dragons, yet it falls short in endowing them with precise texture
details. Meanwhile, incorporating our CGC loss facilitates
the model to capture more semantic details, e.g., “fire” and
“ice”. Nevertheless, it fails to distinguish the two objects.
It is noteworthy that the model equipped with these two
designs together is not only capable of accurately distinguishing between two objects but can also synthesize highquality texture details for them. These experiments indicate
that our DGA module and CGC loss can effectively assist
8

References

position. NeurIPS, 35:30923–30936, 2022. 1, 2, 3, 4, 5, 6,
7, 8
[15] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
Zhang, Stephen Lin, and Baining Guo. Swin transformer:
Hierarchical vision transformer using shifted windows. In
ICCV, pages 10012–10022, 2021. 2, 3
[16] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
regularization. arXiv preprint arXiv:1711.05101, 2017. 5
[17] Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao,
Yongjian Wu, Feiyue Huang, Chia-Wen Lin, and Rongrong
Ji. Dual-level collaborative transformer for image captioning. In AAAI, pages 2286–2293, 2021. 2
[18] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine
translation. arXiv preprint arXiv:1508.04025, 2015. 2
[19] Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang,
and Rongrong Ji. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In ACM MM, pages
638–647, 2022. 2, 5
[20] Yiwei Ma, Xiaioqing Zhang, Xiaoshuai Sun, Jiayi Ji,
Haowei Wang, Guannan Jiang, Weilin Zhuang, and Rongrong Ji. X-mesh: Towards fast and accurate text-driven
3d stylization via dynamic textual guidance. arXiv preprint
arXiv:2303.15764, 2023. 1, 2, 3, 4, 5, 6, 7, 8
[21] Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and
Daniel Cohen-Or. Latent-nerf for shape-guided generation of
3d shapes and textures. In CVPR, pages 12663–12673, 2023.
2, 7, 1, 3
[22] Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and
Rana Hanocka. Text2mesh: Text-driven neural stylization
for meshes. In CVPR, pages 13492–13502, 2022. 1, 2, 5, 6,
7, 8
[23] Nasir Mohammad Khalid, Tianhao Xie, Eugene Belilovsky,
and Tiberiu Popa. Clip-mesh: Generating textured meshes
from text using pretrained image-text models. In SIGGRAPH
Asia, pages 1–8, 2022. 2
[24] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv
preprint arXiv:2209.14988, 2022. 2
[25] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021. 1, 2, 3,
5
[26] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1
(2):3, 2022. 2
[27] Elad Richardson, Gal Metzer, Yuval Alaluf, Raja Giryes,
and Daniel Cohen-Or. Texture: Text-guided texturing of 3d
shapes. arXiv preprint arXiv:2302.01721, 2023. 2, 7, 1, 3
[28] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–
10695, 2022. 2, 3
[29] Scott D Roth. Ray casting for modeling solids. Computer
graphics and image processing, 18(2):109–144, 1982. 3

[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.
Neural machine translation by jointly learning to align and
translate. arXiv preprint arXiv:1409.0473, 2014. 2
[2] Edward Loper Bird, Steven and Ewan Klein. Natural language processing with python. O’Reilly Media Inc, 2009.
4
[3] Mark Boss, Raphael Braun, Varun Jampani, Jonathan T Barron, Ce Liu, and Hendrik Lensch. Nerd: Neural reflectance
decomposition from image collections. In ICCV, pages
12684–12694, 2021. 3
[4] Arthur Caetano and Misha Sra. Arfy: A pipeline for adapting
3d scenes to augmented reality. In Adjunct Proceedings of
the 35th Annual ACM Symposium on User Interface Software
and Technology, pages 1–3, 2022. 1
[5] Angel X Chang, Thomas Funkhouser, Leonidas Guibas,
Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese,
Manolis Savva, Shuran Song, Hao Su, et al. Shapenet:
An information-rich 3d model repository. arXiv preprint
arXiv:1512.03012, 2015. 5
[6] Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee, Sergey
Tulyakov, and Matthias Nießner. Text2tex: Text-driven
texture synthesis via diffusion models. arXiv preprint
arXiv:2303.11396, 2023. 2
[7] Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia.
Fantasia3d: Disentangling geometry and appearance for
high-quality text-to-3d content creation. arXiv preprint
arXiv:2303.13873, 2023. 1, 2, 7, 3
[8] Shaoyu Chen, Budmonde Duinkharjav, Xin Sun, Li-Yi Wei,
Stefano Petrangeli, Jose Echevarria, Claudio Silva, and Qi
Sun. Instant reality: Gaze-contingent perceptual optimization for 3d virtual reality streaming. IEEE TVCG, 28(5):
2157–2167, 2022. 1
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova.
Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint
arXiv:1810.04805, 2018. 2
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929, 2020. 3
[11] Kevin Frans, Lisa Soros, and Olaf Witkowski. Clipdraw:
Exploring text-to-drawing synthesis through language-image
encoders. Advances in Neural Information Processing Systems, 35:5207–5218, 2022. 5
[12] Mengdi Han, Xiaogang Guo, Xuexian Chen, Cunman Liang,
Hangbo Zhao, Qihui Zhang, Wubin Bai, Fan Zhang, Heming
Wei, Changsheng Wu, et al. Submillimeter-scale multimaterial terrestrial robots. Science Robotics, 7(66):eabn0602,
2022. 1
[13] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, pages 7132–7141, 2018. 2
[14] Jiabao Lei, Yabin Zhang, Kui Jia, et al. Tango: Text-driven
photorealistic and robust 3d stylization via lighting decom-

9

[45] Kai Zhang, Nick Kolkin, Sai Bi, Fujun Luan, Zexiang Xu,
Eli Shechtman, and Noah Snavely. Arf: Artistic radiance
fields. In ECCV, pages 717–733. Springer, 2022. 1
[46] Xiuming Zhang, Pratul P Srinivasan, Boyang Deng, Paul Debevec, William T Freeman, and Jonathan T Barron. Nerfactor: Neural factorization of shape and reflectance under an
unknown illumination. ACM TOG, 40(6):1–18, 2021. 3
[47] Xuying Zhang, Xiaoshuai Sun, Yunpeng Luo, Jiayi Ji, Yiyi
Zhou, Yongjian Wu, Feiyue Huang, and Rongrong Ji. Rstnet:
Captioning with adaptive attention on visual and non-visual
words. In CVPR, pages 15465–15474, 2021. 2
[48] Xuying Zhang, Bowen Yin, Zheng Lin, Qibin Hou, DengPing Fan, and Ming-Ming Cheng. Referring camouflaged
object detection. arXiv preprint arXiv:2306.07532, 2023. 2
[49] Qingnan Zhou and Alec Jacobson.
Thingi10k: A
dataset of 10,000 3d-printing models.
arXiv preprint
arXiv:1605.04797, 2016. 5
[50] Zoran Zivkovic. Improved adaptive gaussian mixture model
for background subtraction. In ICPR, pages 28–31. IEEE,
2004. 4

[30] Oana Sidi, Oliver Van Kaick, Yanir Kleiman, Hao Zhang,
and Daniel Cohen-Or. Unsupervised co-segmentation of a
set of shapes via descriptor-space spectral clustering. In SIGGRAPH Asia, pages 1–10, 2011. 5
[31] Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara
Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features
let networks learn high frequency functions in low dimensional domains. NeurIPS, 33:7537–7547, 2020. 4
[32] TurboSquid.
Turbosquid 3d model repository.
In
https://www.turbosquid.com/, 2021. 5
[33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. NeurIPS, 30, 2017. 2
[34] Can Wang, Menglei Chai, Mingming He, Dongdong Chen,
and Jing Liao. Clip-nerf: Text-and-image driven manipulation of neural radiance fields. In CVPR, pages 3835–3844,
2022. 1
[35] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, pages 7794–
7803, 2018. 3
[36] Mingrui Wu, Xuying Zhang, Xiaoshuai Sun, Yiyi Zhou,
Chao Chen, Jiaxin Gu, Xing Sun, and Rongrong Ji. Difnet:
Boosting visual information flow for image captioning. In
CVPR, pages 18020–18029, 2022. 2
[37] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d
shapenets: A deep representation for volumetric shapes. In
CVPR, pages 1912–1920, 2015. 5
[38] Jianwei Yang, Yonatan Bisk, and Jianfeng Gao. Taco:
Token-aware cascade contrastive learning for video-text
alignment. In ICCV, pages 11562–11572, 2021. 3
[39] Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe
Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and
Chunjing Xu. Filip: Fine-grained interactive language-image
pre-training. arXiv preprint arXiv:2111.07783, 2021. 3
[40] Bowen Yin, Xuying Zhang, Qibin Hou, Bo-Yuan Sun, DengPing Fan, and Luc Van Gool. Camoformer: Masked separable attention for camouflaged object detection. arXiv
preprint arXiv:2212.06570, 2022. 2
[41] Bowen Yin, Xuying Zhang, Zhongyu Li, Li Liu, Ming-Ming
Cheng, and Qibin Hou. Dformer: Rethinking rgbd representation learning for semantic segmentation. arXiv preprint
arXiv:2309.09668, 2023. 2
[42] Kangxue Yin, Jun Gao, Maria Shugrina, Sameh Khamis, and
Sanja Fidler. 3dstylenet: Creating 3d shapes with geometric
and texture style variations. In ICCV, pages 12456–12465,
2021. 1
[43] Bo Zhang, Lizbeth Goodman, and Xiaoqing Gu. Novel
3d contextual interactive games on a gamified virtual
environment support cultural learning through collaboration among intercultural students. SAGE Open, 12(2):
21582440221096141, 2022. 1
[44] Kai Zhang, Fujun Luan, Qianqian Wang, Kavita Bala, and
Noah Snavely. Physg: Inverse rendering with spherical gaussians for physics-based material editing and relighting. In
CVPR, pages 5453–5462, 2021. 3

10

TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes
Supplementary Material
7. Neural Stylization and Controls
In this section, we provide more details on the neural stylization and controls of our TeMO method. We first render
several stylized multi-object 3D assets of our TeMO from 4
views around them. As shown in Fig. 7, the rendered images exhibit natural variation in texture and a high degree
of consistency across different viewpoints. Take the cathorse mesh coupled with a text prompt “a Garfield cat and
a brown horse” as an example, our TeMO not only synthesize “Garfield” and “brown” property for the cat and horse
separately but also generate visually plausible 3D content
in different angles of view. Then, we report more stylized results generated by our TeMO, in which each multiobject mesh is stylized according to several text prompts.
As shown in Fig. 8, our TeMO is able to synthesize stylized
content faithful to different text prompts for the given mesh,
which proves the robustness of our method for a variety of
multi-object 3D scenes. Furthermore, we also give details
of the bare mesh and stylized 3D assets by zooming in their
local regions. As shown in Fig. 9, our stylization results
can capture both global semantics and part-aware details,
conforming to the text prompts. These experimental results
indicate our TeMO is able to stylize the entire mesh in an
object-consistent manner and flexibly generate results with
accurate details as well as high fidelity.

(a) a fire dragon and an ice dragon

(b) a Garfield cat and a brown horse

8. Qualitative Evaluation
Comparison with Diffusion-Based Methods. In this part,
we compare the proposed TeMO with recent representative 3D stylization methods based on diffusion strategies,
including Latent-NeRF [21] (CVPR 2023), TEXTure [27]
(SIGGRAPH 2023), and Fantasia3D [7] (ICCV 2023). As
shown in Fig. 10, the existing diffusion-based methods are
prone to interference between different properties of the objects for a scene with multiple objects of the same/different
categories. For the two-dragon mesh with a text prompt “a
fire dragon and an ice dragon”, these methods pay more attention to the “fire” or “ice” property in each object. As
far as the 3D scenes containing two or more different objects, these methods tend to focus on a certain property or
mix multiple different properties together. Differently, our
TeMO equipped with 3D scene parsing and multi-grained
supervision is able to accurately synthesize the desired stylized content for each object in all multi-object scenes.

(c) a superman, an ice whale, and a fire dragon

Figure 7. Given several multi-object meshes, our TeMO stylizes
entire 3D content on them to adhere to the text prompts.

based methods fail to stylize multi-object scenes without
misunderstanding various properties. Note that the existing diffusion-based methods utilize the priors in the pretrained 2D text-to-image diffusion model by performing inference, on which basis the optimization of the 3D representation is achieved via a differential rendering. We cannot help thinking about a straightforward question: Could
the pre-trained diffusion model generate accurate representations and images for the textual descriptions containing
multiple objects? Normally, the diffusion model employs
the text encoder of the CLIP model [25] to extract global

Analysis of Diffusion-based models on Multi-Object
Scenes. Here, we try to analyze the reason why diffusion1

person & dragon

“an iron man and
an ice dragon”

“a superman and a
fire dragon”

“an astronaut and a
gold dragon”

“a Yeti and a green
dragon”

cat & horse

“a Garfield cat and
a brown horse”

“a ginger cat and an
astronaut horse”

“an embroidered cat and a
horse with spotted fur”

“a silver cat and a
gold dragon”

Figure 8. Given the same bare mesh, our TeMO method is able to produce stylized contents of high fidelity and various types for multiobject scenes to conform to the different text prompts.

vase & candle

“a wicker vase and a
candle in jeans”

“a cactus vase and a
silver candle”

“a wood vase and a
brick candle”

“a chainmail vase and
a gold candle”

Figure 9. TeMO produces accurate and photorealistic details over a variety of multi-object scenes, driven by a series of text prompts. The
local stylization results in red rectangle regions are zoomed in for better viewing.

semantic features of the text prompt for the guidance of image generation. As discussed in Sec. 1 of the main text, it
is difficult for the CLIP model to encode the text description containing multiple objects with a global semantic representation. We argue such an issue inevitably causes obstacles for diffusion methods in generating multi-object 2D
scenes. The poor multi-object results generated by the current cutting-edge diffusion methods for text-to-image like
stable-diffusion [28], as shown in Fig. 11, verify our hypothesis. To address this issue, a foreseeable solution is to
extend the concept of scene parsing proposed in this paper to the diffusion model. We hope this perspective could

inspire future works in the content editing of 2D /3D multiobject scenes.

2

Bare Mesh

Latent-NeRF [21]

TEXTure [27]

Fantasia3D [7]

two dragons

Text Prompt: “a fire dragon and an ice dragon”

vase & candle

Text Prompt: “a wood vase and a brick candle”

person & dragon & whale

Text Prompt: “a superman, a fire dragon, and an ice whale”

Our TeMO

Figure 10. Visual comparisons of our TeMO with recent representative 3D stylization methods based on diffusion strategies on several
multi-object scenes, including two objects of the same or different categories, and three different objects.

(a) a black cat and a white cat

(b) a white cat and a brown dog

Figure 11. Examples of multi-object 2D scenes generated by the
cutting-edge text-to-image method, i.e., stable-diffusion [28]. For
a scene with multiple objects of the same category or different
categories, the diffusion model is prone to interference between
different properties of the objects.

3