TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes Bo-Wen Yin Yuming Chen Zheng Lin Yunheng Li Qibin Hou* Ming-Ming Cheng VCIP, School of Computer Science, Nankai University arXiv:2312.04248v1 [cs.CV] 7 Dec 2023 Xuying Zhang Abstract Recent progress in the text-driven 3D stylization of a single object has been considerably promoted by CLIP-based methods. However, the stylization of multi-object 3D scenes is still impeded in that the image-text pairs used for pretraining CLIP mostly consist of an object. Meanwhile, the local details of multiple objects may be susceptible to omission due to the existing supervision manner primarily relying on coarse-grained contrast of image-text pairs. To overcome these challenges, we present a novel framework, dubbed TeMO, to parse multi-object 3D scenes and edit their styles under the contrast supervision at multiple levels. We first propose a Decoupled Graph Attention (DGA) module to distinguishably reinforce the features of 3D surface points. Particularly, a cross-modal graph is constructed to align the object points accurately and noun phrases decoupled from the 3D mesh and textual description. Then, we develop a Cross-Grained Contrast (CGC) supervision system, where a fine-grained loss between the words in the textual description and the randomly rendered images are constructed to complement the coarse-grained loss. Extensive experiments show that our method can synthesize highquality stylized content and outperform the existing methods over a wide range of multi-object 3D meshes. Our code and results will be made publicly available. Bare Mesh TANGO Our TeMO two dragons “a fire dragon and an ice dragon” person & dragon “an iron man and an ice dragon” Figure 1. Visual comparisons between the existing 3D stylization methods (e.g., TANGO [14]) and our TeMO in multi-object scenes. For a scene with multiple objects of the same/different categories, existing methods are prone to interference between different properties of the objects, while our TeMO is able to accurately synthesize the desired stylized content for each object. model, we choose to work with text-driven 3D stylization. Recent years have witnessed the emergence of a series of impressive works [7, 14, 20, 22], aiming to drive the advancement of text-driven 3D stylization. Existing methods usually adopt multi-layer perceptrons (MLPs) to predict the location attribute displacements of the bare mesh under the supervision of the contrastive loss in CLIP. We observe that these works focus on the stylization of a single 3D object and perform poorly on multiple objects, as shown in the second column of Fig. 1. We argue that two inherent characteristics of CLIP result in this issue: i) CLIP is mainly pretrained with image-text pairs mostly consisting of a single object; ii) CLIP loss employs global representation vectors from images and text to coarsely match these two modalities, which inevitably causes the loss of local details. Moreover, the key to synthesizing desired styles for multiple 3D objects lies in the parsing of such 3D scenes and the multigrained supervision for details refinement. To simultaneously generate stylized content for multiple 3D objects, the primary step is to achieve accurate align- 1. Introduction 3D asset creation through stylization aims to synthesize stylized content on the bare meshes to conform to the given text descriptions [14, 22], referring images [34, 45], or 3D shapes [42]. This research plays an important role in a wide spectrum of applications, e.g., virtual/augmented reality [4, 8], gaming industries [43], and robotics [12]. Moreover, it also presents considerable potential and has attracted increasing attention in computer vision and graphics communities. Considering the ready availability and expressiveness of text prompts as well as the popularity of largescale Contrastive Language-Image Pre-training (CLIP) [25] * Qibin Hou is the corresponding author 1 ment between the objects in the 3D mesh and the target text. However, existing methods employ global semantics of the text to stylize a single object, which inevitably produces noises when stylizing the objects in multi-object scenes. To overcome this challenge, we propose to parse the 3D scene by introducing a Decoupled Graph Attention (DGA) module. Specifically, all noun phrases are decoupled from the text prompt and the mesh surface points of the current view are divided into several clusters as well. Then, a crossmodal graph is constructed to establish connections of the noun phrases to their corresponding object points while distancing them from the irrelevant ones. This graph enables the accurate interaction between two interrelated modalities. Finally, the surface point features of 3D objects can be reinforced by independent cross-attention fusions with their neighboring word nodes in the graph architecture. Furthermore, we also design a Cross-Grained Contrast (CGC) loss to perform comprehensive cross-modal supervision for the stylization of multiple objects. The goal is to guide the network to generate more stylization details for multiple 3D objects to match the target text. Our loss consists of two parts, i.e., coarse-grained contrast and finegrained contrast. In the former part, the text prompt is regarded as sentence-level supervision, which calculates the similarity between the 2D views rendered from the stylized 3D mesh and the text prompt using the global feature vectors from the CLIP model. In the latter part, we see the text prompt from the word level, and consider the similarities between each word of the sentence and the rendered images of the view sets. To be specific, we produce the word representations of the text prompt by taking the hidden states from the text encoder of CLIP. Motivated by the recent process in video-text retrieval [19], we calculate fine-grained loss via the weighted summation of the element in similarity vectors based on the importance of each word or image. Based on the well-designed DGA module and CGC loss, we propose a novel framework towards Text-Driven 3D stylization for Multi-Object Meshes, called TeMO. To validate the effectiveness of our TeMO, extensive experiments are conducted on various multi-object 3D scenes, as shown in the 3rd column of Fig. 1. The experimental results demonstrate our TeMO is less susceptible to interference from multiple objects and can generate superior stylized assets compared with the existing 3D stylization methods. Our contributions can be summarized as follows: • We present a new 3D stylization framework, called TeMO. To the best of our knowledge, it is the first attempt to parse the objects in the text and 3D meshes and generate stylizations for multi-object scenes. • We propose a Decoupled Graph Attention (DGA) module, which constructs a graph structure to align the surface points in the multi-object mesh and the noun phrases in the text prompt. • We design a Cross-Grained Contrast (CGC) loss, in which the text is contrasted with the rendered images from sentence and word levels. 2. Related Work 2.1. Text-Driven 3D Manipulation Generating or editing 3D content according to a given prompt is a long-standing objective in computer vision and graphics. Among all forms of the prompt, the text has garnered the most conspicuous gaze due to three reasons: i) Text descriptions are readily accessible from the existing corpus; ii) Text descriptions are particularly user-friendly since they are easily modifiable and can effectively express complex concepts related to stylizations; iii) The popularity of large-scale CLIP [25] model has made achieving visuallanguage supervision possible. Text2Mesh [22] proposes a neural-style field network to predict the color and displacement of mesh vertices. TANGO [14] proposes to disentangle the appearance style as the spatially varying bidirectional reflectance distribution, the local geometric variation, and the lighting condition. Then, X-Mesh [20] integrates the target text guidance by utilizing text-relevant spatial and channel-wise attention during vertex feature extraction. Motivated by the remarkable progress in text-driven 2D generation [26, 28], TEXTure [27] and Text2Tex [6] incorporate a pre-trained depthaware image diffusion model to synthesize high-resolution partial textures from multiple viewpoints progressively. To make full use of the priors in the pre-trained 2D textto-image diffusion model, DreamFusion [24] introduces a Score Distillation Sample (SDS) loss to perform text-to-3D synthesis. With the help of SDS loss, Latent-NeRF [21] and Fantasia3D [7] can generate 3D shapes and appearances for 3D objects. Despite achieving impressive results, these methods focus on the stylization of a single 3D object and rarely explore multi-object scenes. CLIP-Mesh [23] attempts to generate multiple 3D objects for target text. Nevertheless, the resulting content is not satisfactory. In this paper, we parse the objects described in rendered images and text prompts, aligned by two well-designed strategies. 2.2. Attention Mechanism The concept of the attention mechanism was initially introduced in neural machine translation [1], where the weighted summation of the candidate vectors is calculated according to their importance scores. This technology has been extended to a myriad of tasks, e.g., natural language processing [9, 18, 33], computer vision [13, 15, 40, 48], and multi-modal learning [17, 36, 41, 47]. For instance, Transformer [33] employs the self-attention operation to establish connections between words within a sentence and utilize the cross-attention mechanism to align source and tar2 “a superman, a fire dragon, and an ice whale” FT ∈ Rdim FI ∈ Rdim Avg CGC Text Encoder Loss T ∈ Rm×dim Text Prompt I ∈ Rn×dim DGA Ray Casting 𝑥𝑝 ⋮ Image Encoder Diffuse Roughness Specular ⋮ Rendering Lighting 𝒇𝒓 ∙ DGA 𝑛𝑝 ⋮ ⋮ Normal 𝒇𝒏 ∙ Mesh Figure 2. The overall architecture of the proposed TeMO framework. We first specify several cameras to cast rays toward the objects in the 3D mesh scene. Then, a surface point xp and normal np can be attained from each ray intersected with the objects. These points and normals are fed to the attribute prediction network where the features of 3D objects are parsed and interacted with the decoupled text features via our proposed DGA module. Meanwhile, we employ a series of spherical Gaussians to represent the lighting. Finally, a differentiable SG render is adopted to render images, which are utilized to contrast with the text prompt by our designed CGC loss. such coarse-grained supervision. In this paper, we propose a cross-grained supervision strategy, which considers finegrained and coarse-grained contrasts to achieve a more precise semantic alignment between rendered image and text. get sentences. Non-local network [35] takes the lead in introducing self-attention to computer vision and achieves great success in video understanding and object detection. ViT [10] treats an image as a sequence of patches and employs a Transformer encoder based on self-attention to perform image classification. Swin Transformer [15] introduces shifted windows to enhance the local perception ability of self-attention. More recently, X-Mesh [20] designs a text-guided dynamic attention mechanism for vertex feature extraction of a 3D object. However, this guidance only relies on a text feature vector without considering the parsing of text and 3D scenes. In this paper, the multiple objects decoupled from the target text and 3D mesh are aligned via a cross-modal graph to achieve precise guidance. 3. Methodology 3.1. Overall Architecture Fig. 2 shows the end-to-end architecture of our TeMO framework. Given a bare mesh and a text prompt containing multiple objects, the TeMO aims to synthesize stylization on the mesh to match the text descriptions. We employ a set of vertices V ∈ Re×3 and faces F ∈ {1, ..., e}u×1 to explicitly define the input triangle mesh, which is fixed throughout the training. Following TANGO [14], we disentangle the appearance style as the spatially varying bidirectional reflectance distribution function [3, 44, 46] (including diffuse, roughness, specular), the local geometric variation (normal map), and the lighting condition. We start by normalizing the vertex coordinates to lie inside a unit sphere. Then, we randomly sample points around the mesh using Gaussian distribution as camera positions to render images. Next, we can obtain a camera ray Rp = {c + tνp } from the sampled camera position c and a pixel p in rendered images, where νp is the direction of the ray. Further, ray casting [29] is used to seek out the ray and mesh’s first intersection point and intersection face. Moreover, the normal np ∈ R3 of the intersection face is employed as surface normal at the point xp ∈ R3 . To achieve multiview-consistent features, our TeMO is restricted to predicting the normal displacement as a function of the location, while allowing the color materials to be predicted as a function of both location and view- 2.3. Multi-modal Contrastive Learning Contrastive learning has become an increasingly popular research topic in the multi-modal community due to its ability to align different modal representations. Based on this strategy, CLIP [25] is pre-trained on an abundance of imagetext pairs, achieving great success in cross-modal supervision. TACo [38] presents a token-aware cascade contrastive learning based on the syntactic classes of words to achieve fine-grained semantic alignment in text-video retrieval. Concurrently, FILIP [39] proposes comparing the image patches with the words in the sentence. Regarding the text-driven 3D stylization, the CLIP loss, which calculates the similarity between the image and text vectors in the embedding space of CLIP, is adopted by the vast majority of methods. Although achieving impressive results in stylizing a single object, these methods cannot be well adapted to scenes with multiple 3D objects. We argue an important reason for this issue is the loss of local details caused by 3 Inputs Scene Parsing Object Alignment Between Different Modalities 𝑤11 𝑥𝑝 ⋮ 𝑥1 ⋮ ⋮ 𝑥𝑙 ⋮ 𝑥𝑝 “a blue whale” “a blue whale, an ice dragon with red eyes, and an iron man” Text Prompt “an ice dragon with red eyes” “an iron man” “a blue whale” 𝑤24 ⋮ Mesh “an ice dragon with red eyes” 𝑤22 “an iron man” ⋮ 𝑤33 Location Nodes ⋮ Word Nodes Figure 3. Construction pipeline of the cross-modal graph architecture in our DGA module. Note that xp , the surface point of the 3D objects, and wij , the j-th word in the i-th noun phrase, are connected together only if they correspond to the same object. point set {x1 , ..., xp , ...} of the current ray and the mesh. Meanwhile, we can obtain a binary map of objects in the current view based on whether the ray intersects with the mesh. Further, we can decouple the objects in the binary map according to the clustered points and acquire several binary maps of individual objects. Based on the disentangled noun phrases and binary maps for multiple objects, we can match the correct pairs by their semantic similarities. As a result, the objects described in the text are aligned with their corresponding objects in the mesh, which are utilized to construct a cross-modal graph G = (V, E), as shown in Fig. 3. To be specific, all surface point features and word features are considered as independent nodes to form the node set V. For the edge set V, the link between the surface point node and the word node will be built if the semantic objects they belong to are the same. ing direction. Therefore, our TeMO represented as MLPs includes two branches, i.e., normal branch fn (·) and reflectance branch fr (·). Specifically, the former is utilized as the prediction of normal offset on the point xp , and the latter is designed to predict surface reflectance coefficients of the material at the location xp , i.e., diffuse, roughness, and specular. To synthesize high-frequency details, we also apply the Fourier positional encoding [31] to every input. In addition, the spherical Gaussian is employed to represent each light intensity Li (·) due to its closed-form nature and analytical solution. Based on the attained geometric and appearance components, each pixel color in the rendered image can be calculated by a hemisphere renderer [14]: Z Lp (νp , xp , np ) = Li (wi )fr (νp ,wi ,xp )(wi · n̂p )dwi , (1) Ω n̂p = np + fn (xp , np ), (2) Under the setting of this cross-modal graph, we can individually perform cross-attention between the surface point nodes and their neighboring word nodes, where the parsed surface point features are used as queries, and the parsed text features serve as keys and values. The enhancement of the surface point node vi ∈ Rdim can be formulated as: where Ω = {wi : wi · n̂p ≥ 0} represents the hemisphere, wi is the incident light direction, and n̂p is the estimated normal on surface point xp . 3.2. Decoupled Graph Attention To achieve text-drive stylization for multiple 3D objects, the key issue that needs to be solved is the accurate alignment between the objects described in the text and those in the meshes. X-Mesh [20] has incorporated text-guided dynamic linear layers, in which the global representation vector of the target object in the text is utilized as guidance to acquire text-aware vertex features. Nevertheless, the global vector containing information about multiple objects is prone to mutual interference and produces semantic noises during guidance for multi-object scenes. To address this challenge, we propose to parse the objects in the text and mesh. We first extract the noun phrases modified by adjectives or prepositional phrases from the text using the NLTK tools [2]. Then, we employ the Gaussian Mixture Model (GMM) [50] to cluster the intersection X v̂i = αij Linear(vj ), (3) vj ∈Adj(vi ) αij = P eWij vj ∈Adj(vi ) e Wij = Wij , Linear(vi )Linear(vj )T √ , dl (4) (5) where Adj(vi ) is the adjacency nodes of vi and Linear(·) represents a linear transformation. With this attention mechanism, the surface point features of different objects in the mesh can be distinguishably reinforced under the guidance of the word features in the parsed text. 4 3.3. Cross-Grained Contrast Supervision where the weights are defined as the degree of correlation between the central and another modality. Finally, we adopt the average value of these two scores as the fine-grained contrast loss, which can be defined as: To guide the optimization of the neural network for 3D stylization, the first step is to render the stylized 3D mesh from multiple 2D views. Most existing methods usually employ the visual encoder and text encoder of CLIP [25] to extract global feature vectors for the rendered image and target text, respectively, which are contrasted to perform cross-modal supervision via cosine similarity: Lcoarse = − FI · F T , ∥FI ∥2 ∥FT ∥2 Lf ine = −(LI + LT )/2. The coarse-grained and fine-grained contrast supervision complement each other to build a cross-grained contrast supervision system. The former is utilized to align the global semantic information of the target text with the 3D objects, and the latter is used to achieve the local semantic alignment. This loss can be defined as: (6) where FI ∈ R512 is the averaged feature vector of the images rendered from different views, FT ∈ R512 denotes the global feature vector of the target text, and ∥ · ∥2 represents the Euclidean norm function. Although achieving impressive results for stylizing a single 3D object, these methods still have limitations in multiobject scenes. Considering that a single feature vector still represents a sentence describing multiple objects, the object details may be lost in large amounts. Therefore, such a coarse-grained contrast supervision is insufficient to guide the neural network in synthesizing photorealistic stylized content for multiple 3D objects. To solve this issue, we construct a fine-grained contrast supervision to complement the coarse-grained one. Specifically, we first calculate the correlation map, i.e., S ∈ Rn×m between the word features in the text and the visual features of the rendered images, which are also extracted from the text encoder and visual encoder of the CLIP: S= I ·TT , ∥I∥2 ∥T ∥2 Lcgcs = λc Lcoarse + λf Lf ine , LT = n X eSI (i) Pn S (i), SI (k) I k=1 e i=1 m X eST (j) Pm S (k) ST (j), T k=1 e j=1 (13) where λc and λf are two hyper-parameters to balance the cross-grained and the fine-grained losses, set to 1.0 and 0.33, respectively. 4. Experiments 4.1. Experiment Setup Datasets. To examine our method across a diverse set of 3D scenes, we first collect 3D object meshes from a variety of sources, i.e., COSEG [30], Thingi10K [49], Shapenet [5], Turbo Squid [32], and ModelNet [37]. Then, we randomly place several objects from the collected 3D set into a mesh using Blender. Note that we down-sample the number of meshes’ vertices and faces to ensure the robustness of our TeMO to low-quality meshes and reduce the burden of GPU during the stylization. The meshes used in this paper contain an average of 79,303 faces, 16% non-manifold edges, 0.2% non-manifold vertices, and 12% boundaries. (7) where I ∈ Rn×512 represents the features of the images rendered from n views, T ∈ Rm×512 indicates the features of m words in the text. Then, we normalize the correlation matrix along the image axis and the text axis, respectively, to retrieve the text of interest and visual components. This process can be formulated as: Pm S(i, k) SI (i) = k=1 , (8) m Pn S(k, j) ST (j) = k=1 . (9) n Inspired by [19], we further calculate an image-centered fine-grained contrast score and a text-centered fine-grained contrast score by the weighted summation of the similarity vectors, which can be formulated as follows: LI = (12) Implementation Details. Following the TANGO [14] network, we adopt 3 linear layers with 256 dimensions to build the normal estimation branch. In the reflectance branch, the point features are extracted by 2 shared layers with 256 dimensions, followed by 3 exclusive layers to predict diffuse, specular, and roughness. The dimension of our DGA module is also set as 256. The word features in our DGA module are extracted from the text encoder of CLIP, and so are the ones in our CGC loss. We choose ViT-B/32 as the backbone of the pre-trained CLIP model in this paper, which is consistent with previous works [14, 20, 22]. We also process the rendered images with 2D augmentation strategies [11, 14] before feeding them into the pre-trained CLIP model. Our TeMO model is optimized with the AdamW [16] strategy for 1500 iterations, where the learning rate is initialized to 5 × 10−4 and decayed by 0.7 every 500 iterations. The entire training process takes about 10 minutes on a single NVIDIA RTX 3090 GPU. (10) (11) 5 person & dragon Text Prompt: “an iron man and an ice dragon” Text Prompt: “a superman and a fire dragon” cat & horse Text Prompt: “a Garfield cat and a brown horse” Text Prompt: “a ginger cat and an astronaut horse” vase & candle Text Prompt: “a cactus vase and a silver candle” Text Prompt: “a wicker vase and a candle in jeans” Figure 4. Given the same bare mesh, our TeMO produces stylized contents of various types for multi-object scenes to conform to the different text prompts. Please refer to supplementary materials for a detailed version. 4.2. Qualitative Evaluation TeMO method can generate photorealistic details with fine granularity and maintain global semantic understanding for the given multi-object 3D scene. We conduct visualization experiments on a wide spectrum of multi-object scenes to verify the effectiveness of our TeMO. However, we observe that the 3D symmetry prior used widely in previous works [14, 22] can cause interference between different parts during the stylization process of multiple objects. We argue that the multiple objects of the meshes used in this paper are randomly placed to simulate a real 3D scene rather than along the z-axis. To avoid this issue, we remove this prior in our TeMO and previous methods involved in the comparison. Qualitative Comparisons. We first provide the visual comparisons of prediction results between our TeMO and previous pioneering works in text-driven 3D stylization, including Text2Mesh [22], TANGO [14], and X-Mesh [20]. To ensure a fair comparison, we adopt the official implementations of these methods and also train them with the default settings without the symmetry prior. The experimental results show it is a real struggle for Text2Mesh [22] and TANGO [14] to understand the detailed semantics of the text prompt with multiple objects. As shown in the 1st row of Fig. 5 where the 3D scene contains two objects of the same category, given a text prompt “a fire dragon and an ice dragon”, they tend to capture the “ice” property, missing the “fire” property. For a 3D scene containing two objects of different categories, they are prone to mixing the properties of these objects, as shown in the 2nd row where the text prompt is “a wood vase and a brick candle”. Therefore, the stylized assets they generate for these multi-object scenes are unsatisfactory. X-Mesh generates more accurate results that align with the text prompts, as shown in the 1st and 2nd rows, which can be attributed to incorporating the text vector while extracting vertex features. However, it can pro- Neural Stylization and Controls. We present the stylization results of our TeMO driven by different text prompts for the same multi-object mesh in Fig. 4. As shown in the 1st row where the 3D scene is composed of a person and a dragon, our TeMO can accurately distinguish between the person object and the dragon object and appropriately stylize different body parts for them according to the semantic roles described in each text prompt. Meanwhile, our TeMO also synthesizes desired stylizations for the 3D objects in the cat-horse mesh and vase-candle mesh, as shown in the 2nd and 3rd rows. Moreover, Fig. 7, Fig. 8, and Fig. 9 of the supplementary material also show our TeMO stylizes the entire 3D scene and generates renderings with accurate details. These experimental results demonstrate that our 6 Bare Mesh Text2Mesh [22] TANGO [14] XMesh [20] Our TeMO (a) Mesh: two dragons; Text Prompt: “a fire dragon and an ice dragon”. (b) Mesh: vase & candle; Text Prompt: “a wood vase and a brick candle”. (c) Mesh: person & dragon & whale; Text Prompt: “a superman, a fire dragon, and an ice whale”. Figure 5. Visual comparisons of our TeMO with previous text-driven 3D stylization methods on several multi-object scenes, including two objects of the same or different categories, and three different objects. See supplementary materials for more comparisons. 4.3. Quantitative Evaluation duce semantic noises due to its utilization of the text vector containing attributes of multiple objects to process all vertex features. With an increasing number of objects, it will also encounter challenges related to comprehending text details and the alignment between the text and 3D objects, as shown in the 3rd row. Besides, we compare our TeMO with recent representative 3D stylization methods based on diffusion strategies [7, 21, 27], as shown in Fig. 10 of the supplementary materials. These methods still fail to generate stylized assets without mixed properties, which can also be attributed to the global semantic guidance of the target text for the stylization of multiple 3D objects. In contrast, our TeMO equipped with 3D scene parsing and multi-grained supervision, can generate photorealistic stylized content for each object in these 3D scenes to conform to the descriptions in the text prompts. Objective Metric. We adopt the CLIP score to objectively evaluate the semantic alignment achieved by our TeMO and recent 3D stylization methods. Specifically, 8 views spaced 45◦ around the stylized meshes are chosen to obtain the rendered 2D images. Then, the visual objects are compared with the textual objects in CLIP’s embedding space via the cosine function. As shown in the 2nd column of Tab. 1, our TeMO surpasses previous methods by a large margin. These results demonstrate the superiority of our TeMO over existing methods on multi-object stylization. User Study. We further conduct a user study to evaluate these 3D stylization methods subjectively. We randomly select 10 mesh-text pairs and recruit 60 users to evaluate 7 Input Mesh Baseline Baseline + DGA module Baseline + CGC loss Our TeMO Figure 6. Ablation experiments on the proposed designs of our TeMO. Mesh: two dragons; Text Prompt: “a fire dragon and an ice dragon”. Table 1. Quantitative comparisons of our TeMO and previous textdriven 3D stylization methods in multi-object scenes, including an objective alignment score (0-1) and three subjective opinion scores (1-5). Note that the higher these metrics, the better the method. Alignment User-Q1 User-Q2 User-Q3 Text2Mesh [22] TANGO [14] X-Mesh [20] 0.262 0.274 0.265 1.750 2.406 1.839 1.506 2.450 1.722 1.472 2.539 1.761 Our TeMO 0.285 3.344 3.311 3.261 the model in generating desired stylized content for multiple 3D objects to conform to the target text. 5. Limitation and Future Work Despite achieving excellent results on text-driven multiobject stylization, our TeMO framework still has a few limitations, which can also facilitate future research: 1) 3D Symmetry Prior. As stated in Sec. 4.2, our TeMO fails to incorporate 3D symmetry prior, whose important role has been demonstrated by Text2Mesh [22] in promoting style consistency of a single object. To generate more photorealistic stylization assets for multi-object scenes, it will be valuable to calculate symmetry planes for each object and apply symmetry priors to them. the quality of the stylization assets generated by our TeMO and previous methods. Particularly, the participants include experts in the field and individuals without specific background knowledge. Moreover, each of them will be asked three questions [22]: (Q1) “How natural is the output results?”(Q2) “How well does the output match the original content?” (Q3) “How well does the output match the target style?”, and then assign a score (1-5) to them. We report the mean opinion scores in parentheses for each factor averaged across all style outputs. As shown in Tab. 1, our TeMO still outperforms other methods across all questions. Therefore, the 3D assets generated by our method are more in line with people’s understanding of the text prompts. 2) Diffusion Model. We observe that current diffusion technologies struggle to generate multi-object images according to the text prompt, which hinders the application of diffusion-based stylization methods in multi-object 3D scenes. We argue it would be interesting to extend the concept of scene parsing to the diffusion models for the release of their potential in multi-object editing or generation. 6. Conclusion 4.4. Ablation Studies In this paper, we present TeMO, an innovative framework proposing scene parsing and multi-grained cross-modal supervision to achieve text-driven multi-object 3D stylization for the first time. Specifically, we first develop a DGA module to precisely align the objects in the 3D mesh and the text prompt and enhance the 3D point features with the word features belonging to the same object as them. Then, we design a CGC loss, in which the fine-grained loss at the local level and coarse-grained contrast loss at the global level are both constructed and complement each other. Further, extensive experiments are conducted to demonstrate the effectiveness and superiority of our methods over the existing methods among a wide range of multi-object 3D scenes. We believe it is promising to achieve content editing of multiple objects in 3D scenes simultaneously, and we hope the scene-parsing perspective provided by the proposed TeMO framework will inspire future works. To verify the effectiveness of the proposed designs in our TeMO, we conduct ablation studies by gradually adding them to our baseline model, i.e., TANGO [14]. We chose the two-dragon mesh with the text prompt “a fire dragon and an ice dragon”, and the experimental results are shown in Fig. 6. Compared to the baseline model, introducing our DGA module enables the model to distinguish two dragons, yet it falls short in endowing them with precise texture details. Meanwhile, incorporating our CGC loss facilitates the model to capture more semantic details, e.g., “fire” and “ice”. Nevertheless, it fails to distinguish the two objects. It is noteworthy that the model equipped with these two designs together is not only capable of accurately distinguishing between two objects but can also synthesize highquality texture details for them. These experiments indicate that our DGA module and CGC loss can effectively assist 8 References position. NeurIPS, 35:30923–30936, 2022. 1, 2, 3, 4, 5, 6, 7, 8 [15] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021. 2, 3 [16] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 5 [17] Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue Huang, Chia-Wen Lin, and Rongrong Ji. Dual-level collaborative transformer for image captioning. In AAAI, pages 2286–2293, 2021. 2 [18] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015. 2 [19] Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, and Rongrong Ji. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In ACM MM, pages 638–647, 2022. 2, 5 [20] Yiwei Ma, Xiaioqing Zhang, Xiaoshuai Sun, Jiayi Ji, Haowei Wang, Guannan Jiang, Weilin Zhuang, and Rongrong Ji. X-mesh: Towards fast and accurate text-driven 3d stylization via dynamic textual guidance. arXiv preprint arXiv:2303.15764, 2023. 1, 2, 3, 4, 5, 6, 7, 8 [21] Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and Daniel Cohen-Or. Latent-nerf for shape-guided generation of 3d shapes and textures. In CVPR, pages 12663–12673, 2023. 2, 7, 1, 3 [22] Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and Rana Hanocka. Text2mesh: Text-driven neural stylization for meshes. In CVPR, pages 13492–13502, 2022. 1, 2, 5, 6, 7, 8 [23] Nasir Mohammad Khalid, Tianhao Xie, Eugene Belilovsky, and Tiberiu Popa. Clip-mesh: Generating textured meshes from text using pretrained image-text models. In SIGGRAPH Asia, pages 1–8, 2022. 2 [24] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022. 2 [25] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021. 1, 2, 3, 5 [26] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 2 [27] Elad Richardson, Gal Metzer, Yuval Alaluf, Raja Giryes, and Daniel Cohen-Or. Texture: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721, 2023. 2, 7, 1, 3 [28] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684– 10695, 2022. 2, 3 [29] Scott D Roth. Ray casting for modeling solids. Computer graphics and image processing, 18(2):109–144, 1982. 3 [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. 2 [2] Edward Loper Bird, Steven and Ewan Klein. Natural language processing with python. O’Reilly Media Inc, 2009. 4 [3] Mark Boss, Raphael Braun, Varun Jampani, Jonathan T Barron, Ce Liu, and Hendrik Lensch. Nerd: Neural reflectance decomposition from image collections. In ICCV, pages 12684–12694, 2021. 3 [4] Arthur Caetano and Misha Sra. Arfy: A pipeline for adapting 3d scenes to augmented reality. In Adjunct Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pages 1–3, 2022. 1 [5] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015. 5 [6] Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee, Sergey Tulyakov, and Matthias Nießner. Text2tex: Text-driven texture synthesis via diffusion models. arXiv preprint arXiv:2303.11396, 2023. 2 [7] Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873, 2023. 1, 2, 7, 3 [8] Shaoyu Chen, Budmonde Duinkharjav, Xin Sun, Li-Yi Wei, Stefano Petrangeli, Jose Echevarria, Claudio Silva, and Qi Sun. Instant reality: Gaze-contingent perceptual optimization for 3d virtual reality streaming. IEEE TVCG, 28(5): 2157–2167, 2022. 1 [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. 2 [10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 3 [11] Kevin Frans, Lisa Soros, and Olaf Witkowski. Clipdraw: Exploring text-to-drawing synthesis through language-image encoders. Advances in Neural Information Processing Systems, 35:5207–5218, 2022. 5 [12] Mengdi Han, Xiaogang Guo, Xuexian Chen, Cunman Liang, Hangbo Zhao, Qihui Zhang, Wubin Bai, Fan Zhang, Heming Wei, Changsheng Wu, et al. Submillimeter-scale multimaterial terrestrial robots. Science Robotics, 7(66):eabn0602, 2022. 1 [13] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, pages 7132–7141, 2018. 2 [14] Jiabao Lei, Yabin Zhang, Kui Jia, et al. Tango: Text-driven photorealistic and robust 3d stylization via lighting decom- 9 [45] Kai Zhang, Nick Kolkin, Sai Bi, Fujun Luan, Zexiang Xu, Eli Shechtman, and Noah Snavely. Arf: Artistic radiance fields. In ECCV, pages 717–733. Springer, 2022. 1 [46] Xiuming Zhang, Pratul P Srinivasan, Boyang Deng, Paul Debevec, William T Freeman, and Jonathan T Barron. Nerfactor: Neural factorization of shape and reflectance under an unknown illumination. ACM TOG, 40(6):1–18, 2021. 3 [47] Xuying Zhang, Xiaoshuai Sun, Yunpeng Luo, Jiayi Ji, Yiyi Zhou, Yongjian Wu, Feiyue Huang, and Rongrong Ji. Rstnet: Captioning with adaptive attention on visual and non-visual words. In CVPR, pages 15465–15474, 2021. 2 [48] Xuying Zhang, Bowen Yin, Zheng Lin, Qibin Hou, DengPing Fan, and Ming-Ming Cheng. Referring camouflaged object detection. arXiv preprint arXiv:2306.07532, 2023. 2 [49] Qingnan Zhou and Alec Jacobson. Thingi10k: A dataset of 10,000 3d-printing models. arXiv preprint arXiv:1605.04797, 2016. 5 [50] Zoran Zivkovic. Improved adaptive gaussian mixture model for background subtraction. In ICPR, pages 28–31. IEEE, 2004. 4 [30] Oana Sidi, Oliver Van Kaick, Yanir Kleiman, Hao Zhang, and Daniel Cohen-Or. Unsupervised co-segmentation of a set of shapes via descriptor-space spectral clustering. In SIGGRAPH Asia, pages 1–10, 2011. 5 [31] Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. NeurIPS, 33:7537–7547, 2020. 4 [32] TurboSquid. Turbosquid 3d model repository. In https://www.turbosquid.com/, 2021. 5 [33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 30, 2017. 2 [34] Can Wang, Menglei Chai, Mingming He, Dongdong Chen, and Jing Liao. Clip-nerf: Text-and-image driven manipulation of neural radiance fields. In CVPR, pages 3835–3844, 2022. 1 [35] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, pages 7794– 7803, 2018. 3 [36] Mingrui Wu, Xuying Zhang, Xiaoshuai Sun, Yiyi Zhou, Chao Chen, Jiaxin Gu, Xing Sun, and Rongrong Ji. Difnet: Boosting visual information flow for image captioning. In CVPR, pages 18020–18029, 2022. 2 [37] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In CVPR, pages 1912–1920, 2015. 5 [38] Jianwei Yang, Yonatan Bisk, and Jianfeng Gao. Taco: Token-aware cascade contrastive learning for video-text alignment. In ICCV, pages 11562–11572, 2021. 3 [39] Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021. 3 [40] Bowen Yin, Xuying Zhang, Qibin Hou, Bo-Yuan Sun, DengPing Fan, and Luc Van Gool. Camoformer: Masked separable attention for camouflaged object detection. arXiv preprint arXiv:2212.06570, 2022. 2 [41] Bowen Yin, Xuying Zhang, Zhongyu Li, Li Liu, Ming-Ming Cheng, and Qibin Hou. Dformer: Rethinking rgbd representation learning for semantic segmentation. arXiv preprint arXiv:2309.09668, 2023. 2 [42] Kangxue Yin, Jun Gao, Maria Shugrina, Sameh Khamis, and Sanja Fidler. 3dstylenet: Creating 3d shapes with geometric and texture style variations. In ICCV, pages 12456–12465, 2021. 1 [43] Bo Zhang, Lizbeth Goodman, and Xiaoqing Gu. Novel 3d contextual interactive games on a gamified virtual environment support cultural learning through collaboration among intercultural students. SAGE Open, 12(2): 21582440221096141, 2022. 1 [44] Kai Zhang, Fujun Luan, Qianqian Wang, Kavita Bala, and Noah Snavely. Physg: Inverse rendering with spherical gaussians for physics-based material editing and relighting. In CVPR, pages 5453–5462, 2021. 3 10 TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes Supplementary Material 7. Neural Stylization and Controls In this section, we provide more details on the neural stylization and controls of our TeMO method. We first render several stylized multi-object 3D assets of our TeMO from 4 views around them. As shown in Fig. 7, the rendered images exhibit natural variation in texture and a high degree of consistency across different viewpoints. Take the cathorse mesh coupled with a text prompt “a Garfield cat and a brown horse” as an example, our TeMO not only synthesize “Garfield” and “brown” property for the cat and horse separately but also generate visually plausible 3D content in different angles of view. Then, we report more stylized results generated by our TeMO, in which each multiobject mesh is stylized according to several text prompts. As shown in Fig. 8, our TeMO is able to synthesize stylized content faithful to different text prompts for the given mesh, which proves the robustness of our method for a variety of multi-object 3D scenes. Furthermore, we also give details of the bare mesh and stylized 3D assets by zooming in their local regions. As shown in Fig. 9, our stylization results can capture both global semantics and part-aware details, conforming to the text prompts. These experimental results indicate our TeMO is able to stylize the entire mesh in an object-consistent manner and flexibly generate results with accurate details as well as high fidelity. (a) a fire dragon and an ice dragon (b) a Garfield cat and a brown horse 8. Qualitative Evaluation Comparison with Diffusion-Based Methods. In this part, we compare the proposed TeMO with recent representative 3D stylization methods based on diffusion strategies, including Latent-NeRF [21] (CVPR 2023), TEXTure [27] (SIGGRAPH 2023), and Fantasia3D [7] (ICCV 2023). As shown in Fig. 10, the existing diffusion-based methods are prone to interference between different properties of the objects for a scene with multiple objects of the same/different categories. For the two-dragon mesh with a text prompt “a fire dragon and an ice dragon”, these methods pay more attention to the “fire” or “ice” property in each object. As far as the 3D scenes containing two or more different objects, these methods tend to focus on a certain property or mix multiple different properties together. Differently, our TeMO equipped with 3D scene parsing and multi-grained supervision is able to accurately synthesize the desired stylized content for each object in all multi-object scenes. (c) a superman, an ice whale, and a fire dragon Figure 7. Given several multi-object meshes, our TeMO stylizes entire 3D content on them to adhere to the text prompts. based methods fail to stylize multi-object scenes without misunderstanding various properties. Note that the existing diffusion-based methods utilize the priors in the pretrained 2D text-to-image diffusion model by performing inference, on which basis the optimization of the 3D representation is achieved via a differential rendering. We cannot help thinking about a straightforward question: Could the pre-trained diffusion model generate accurate representations and images for the textual descriptions containing multiple objects? Normally, the diffusion model employs the text encoder of the CLIP model [25] to extract global Analysis of Diffusion-based models on Multi-Object Scenes. Here, we try to analyze the reason why diffusion1 person & dragon “an iron man and an ice dragon” “a superman and a fire dragon” “an astronaut and a gold dragon” “a Yeti and a green dragon” cat & horse “a Garfield cat and a brown horse” “a ginger cat and an astronaut horse” “an embroidered cat and a horse with spotted fur” “a silver cat and a gold dragon” Figure 8. Given the same bare mesh, our TeMO method is able to produce stylized contents of high fidelity and various types for multiobject scenes to conform to the different text prompts. vase & candle “a wicker vase and a candle in jeans” “a cactus vase and a silver candle” “a wood vase and a brick candle” “a chainmail vase and a gold candle” Figure 9. TeMO produces accurate and photorealistic details over a variety of multi-object scenes, driven by a series of text prompts. The local stylization results in red rectangle regions are zoomed in for better viewing. semantic features of the text prompt for the guidance of image generation. As discussed in Sec. 1 of the main text, it is difficult for the CLIP model to encode the text description containing multiple objects with a global semantic representation. We argue such an issue inevitably causes obstacles for diffusion methods in generating multi-object 2D scenes. The poor multi-object results generated by the current cutting-edge diffusion methods for text-to-image like stable-diffusion [28], as shown in Fig. 11, verify our hypothesis. To address this issue, a foreseeable solution is to extend the concept of scene parsing proposed in this paper to the diffusion model. We hope this perspective could inspire future works in the content editing of 2D /3D multiobject scenes. 2 Bare Mesh Latent-NeRF [21] TEXTure [27] Fantasia3D [7] two dragons Text Prompt: “a fire dragon and an ice dragon” vase & candle Text Prompt: “a wood vase and a brick candle” person & dragon & whale Text Prompt: “a superman, a fire dragon, and an ice whale” Our TeMO Figure 10. Visual comparisons of our TeMO with recent representative 3D stylization methods based on diffusion strategies on several multi-object scenes, including two objects of the same or different categories, and three different objects. (a) a black cat and a white cat (b) a white cat and a brown dog Figure 11. Examples of multi-object 2D scenes generated by the cutting-edge text-to-image method, i.e., stable-diffusion [28]. For a scene with multiple objects of the same category or different categories, the diffusion model is prone to interference between different properties of the objects. 3