iDesigner: A High-Resolution and Complex-Prompt Following Text-to-Image Diffusion Model for Interior Design arXiv:2312.04326v1 [cs.CV] 7 Dec 2023 Ruyi Gan♥♠ Xiaojun Wu♥ Junyu Lu♥ Yuanhe Tian♠ ♥♦ ♥ ♥ Dixiang Zhang Ziwei Wu Renliang Sun Chang Liu♠ Jiaxing Zhang♥ * Pingjian Zhang♦ Yan Song♠ * ♥ International Digital Economy Academy ♦ South China University of Technology ♠ University of Science and Technology of China {ganruyi, wuxiaojun, lujunyu, zhangdixiang, zhangjiaxing, wuziwei, sunrenliang}@idea.edu.cn yhtian@uw.edu lc980413@mail.ustc.edu.cn Abstract pjzhang@scut.edu.cn clksong@gmail.com strides in general media generation [5, 21, 26], they have not been intricately tailored to meet the unique demands of interior design. This disconnect stems from the models’ limited ability to process the rich and often complex vocabulary specific to interior design lexicon. As such, the resultant images frequently lack the sophistication and exactitude that professional design work commands. In this paper, we propose iDesigner, which is designed to generate visually rich and contextually accurate interior design images directly from descriptive text prompts. With the open-sourcing of text-to-image models (T2I) such as stable diffusion (SD) and stable diffusion XL (SDXL), there is an influx of models fine-tuned in specific domains based on the open-source SD model, such as in anime, character portraits, etc. However, there are few specialized models in certain domains, such as interior design, which is attributed to the complex textual descriptions and detailed visual elements inherent in design, alongside the necessity for adaptable resolution. Therefore, text-to-image models for interior design are required to have outstanding prompt-following capabilities, as well as iterative collaboration with design professionals to achieve the desired outcome. In this paper, we collect and optimize text-image data in the design field and continue training in both English and Chinese on the basis of the open-source CLIP model. We also proposed a fine-tuning strategy with curriculum learning and reinforcement learning from CLIP feedback to enhance the prompt-following capabilities of our approach so as to improve the quality of image generation. The experimental results on the collected dataset demonstrate the effectiveness of the proposed approach, which achieves impressive results and outperforms strong baselines.1 iDesigner is crafted to specifically address the unique and complex demands of interior design prompts, a challenge that general-purpose models often fail to meet, resulting in images that fall short of capturing the true essence of the designer’s vision. Our model backbone is satble diffsuion XL (SD-XL) which is the most popular T2I opensourced model in the world [21, 26]. The methodological backbone of iDesigner is its innovative use of curriculum learning [3], a pedagogically inspired approach that scales the complexity of tasks in a graduated manner. It begins with the generation of basic, low-resolution images to ground the model in the foundational aspects of design aesthetics and function. As the model’s proficiency escalates, the curriculum progresses to sophisticated, high-resolution image creation, paying meticulous attention to fine details that define the quality and accuracy of professional interior design imagery. Complementing this, iDesigner incorporates a specialized captioner module [14, 20] designed to parse and optimize complex textual prompts. This enables the model to produce images with a higher fidelity to the designer’s intent. The integration of the Reinforced Learning from CLIP Feedback (RLCF) method [4, 15, 32, 38] further refines the model’s ability to follow prompts, establishing a reinforcing loop between textual instructions and 1. Introduction Interior design synthesizes form, function, and aesthetics within physical spaces, requiring meticulous attention to detail and a deep understanding of cultural and contextual elements [6]. While text-to-image (T2I) models have made * Corresponding Authors. 1 We will release the code and model soon. 1 Figure 1. Example images generated by the proposed iDesigner model. image content that enhances the precision and relevance of the generated images. Figure 1 presents images generated by iDesigner, which demonstrate the effectiveness of iDesigner. More images generated by iDesigner are presented in here 2 . The contributions of this paper are summarized as follows: • A novel text-to-image model, iDesigner, tailored for the interior design domain and made available to the community for collaborative enhancement and research. • A strategic application of prompt engineering and Large Language Models (LLMs) to produce more detailed and vivid captions, which markedly enhances the quality of the generated images. • A pioneering integration of a curriculum learning framework with a diffusion model that progressively sharpens the model’s generative capabilities, ensuring superior image quality. While diffusion models are known for their proficiency in generating detailed images, our curriculum-based approach elevates this to a new level. • A RLCF method that reinforces the model’s prompt adherence, fine-tuning image generation in accordance with detailed textual instructions. augmenting the effectiveness of the text-to-image generation. Finally, our trained CLIP model is utilized as a feedback mechanism in reinforcement learning training, significantly improving the model’s ability to follow instructions within the interior design domain. 2.1. Dataset Data in Intreior Design Domain Due to the current scarcity of high-quality datasets in the field of interior design in the academic community, we established a collaborative partnership with an internationally acclaimed interior design company, leveraging their extensive historical design records. Our ambitious annotation effort involved the participation of over 1,000 seasoned interior design professionals who meticulously labeled the data. This painstaking process resulted in the curation of a high-quality dataset comprising 3,600 high-quality image-text pairs. To ensure robust model training and evaluation, we carefully divided the dataset into training and testing sets, maintaining a balanced ratio of 9:1. This rigorous data collection and partitioning strategy laid the foundation for our research, enabling us to explore the intricate relationships between text and images in the domain of interior design. 2. The Approach Data Recaptioning Our dataset in interior design composed of high-quantity pairings (X, Y ), where X represent an image and Y is a text composed of multiple parallel labels that describes the image. In the field of interior design, Y generally comes from the discretized tags marked by designers or derived from web-crawled source, focusing on the simple description of materials, styles and colors. Worse, the web-crawled resource oftentimes contains irrelevant tags, which cannot accurately describe the images. Since the discretized tags inevitably overlook the spatial layout and local details of interior design, and irrelevant tags may mislead the cognition of models, we theorize that such shortcomings can be addressed using synthetically The optimization process of our iDesigner model primarily encompasses several key components. Initially, we annotate interior design renderings across multiple dimensions and employ GPT-3.5 [20] for rewriting, facilitating the training of our text-to-image model. Subsequently, we conduct a two-stage training process on the CLIP model, yielding a foundation model that is both universally applicable and enhanced for the design domain. Furthermore, this foundational model replaces the textual component of the SD-XL model. We then progressively increase image resolution in a curriculum learning approach during fine-tuning, thereby 2 The images generated by iDesigner are presented in Appendix A. 2 generated captions. For this purpose, We first collect and structure various interior design datasets. Then, we refer to DALL-E 3 [14] and formulate a set of well-designed system prompts to invoke the GPT3.5 interface for rewriting the alt text into descriptive synthetic captions. memory requirements. The final phase involves training on specific text-image datasets, with the aim of refining the model’s capability to generate images that closely match the input textual descriptions. 2.3.2 2.2. CLIP Training In the context of iDesigner, the nuanced domain of interior design requires a model that is not only proficient in generating images but also exceptional in rendering highresolution images where even the smallest design elements are crisply defined. To this end, iDesigner employs a curriculum learning (CL) approach, inspired by the pioneering concept introduced by [3], which mirrors the human learning process of progressing from simple to complex levels of understanding. Let us denote the function Gθ as the mapping from textual descriptions to image outputs parameterized by θ, the set of parameters of our model. In the realm of non-convex optimization problems, curriculum learning has demonstrated a substantial enhancement in performance along with a robust generalization capacity. The strategy underpinning curriculum learning expedites the rate of convergence as well as facilitates the discovery of superior local optima within the context of non-convex optimization landscapes. We denote (xi , yi ) as the paired association of an image xi and its corresponding textual descriptor yi , which together constitute the i-th instance within the training corpus. The curriculum comprises two steps, each with a dedicated loss function to optimize. For the initial step (Step 1) at resolution 1024 × 1024, we define a loss function L1 which focuses on the global structure and basic elements of design space: h i L1 (θ) := EE(xlow ),y,ϵ∼N (0,1),t ∥ϵ − ϵθ (zt , t, τθ (y))∥22 , (1) where xlow is the downsampled version of xi to the lower resolution and loss1 measures the difference between the generated low-resolution image and the ground truth. These models can be interpreted as an equally weighted sequence of denoising latent autoencoders ϵθ (zt , t); t = 1 . . . T where the ϵθ is realized as a time-conditional UNet [27]. . Since the forward process is fixed, t can be efficiently obtained from E(x) during training and can be decoded to image space with a single pass through vae decoder [16]. And τθ can be parameterized with transformer text encoder model, both τθ and ϵθ are jointly optimized via Eq. 1. This foundational step allows the model to establish an understanding of the broader aesthetic and functional principles of interior design without being overwhelmed by the intricacies of high-resolution details. Once the model has demonstrated proficiency in generating coherent and contextually accurate images at this resolution, the curriculum progresses to Step 2, where the resolution is elevated to 2048×2048 at resolution. We introduce The vision-language foundation models such as CLIP [22] are crucial component for aligning images and text representations, which can capture the correlation between crossmodal features. Since the open-source CLIP model cannot meet the requirements of bilingual adaptation and multielement cognition in interior design, We initialize from the pre-trained English-only CLIP and continue training in two stages. In the first stage, we collect a large-scale, webcrawled set of bilingual image-text pairs, including Laion [29], Wukong [12], and make effort to clean the data. We take the contrastive loss as training objective and utilize a distributed memory-efficient CLIP training approach to reduce the memory consumption [7]. In the second stage, we continue to train our CLIP using high-quality common and interior design image-text pairs preprocessed by the image captioner. In interior design domain, the same image may match with multiple distinctly different texts, observing the image from different perspectives and details. 2.3. iDesigner Training This section primarily discusses the core modules involved in the training of the iDesigner model. Firstly, we introduce the fundamental process of text-to-image generation and describe how we replaced the textual encoder part of the open-source SD-XL with our previously trained CLIP model. Secondly, we detail our approach to curriculum learning and the reinforcement learning method based on CLIP feedback. 2.3.1 Curriculum Learning Text-to-image Generation Process In the realm of text-to-image generation, particularly with diffusion models, the methodology can be broadly categorized into two primary phases: Text Encoding Traditional models often employ the CLIP text encoder for feature extraction from textual descriptions. For Chinese-specific applications, the CLIP text encoder is replaced with a dedicated Chinese encoder. This adaptation ensures better alignment with Chinese linguistic structures and semantics. Text-to-image Generation Once textual features are extracted, they are incorporated into the latent diffusion process. The diffusion process in the latent space offers computational efficiency, reducing both processing time and 3 Rejection Sampling Ranking The modern minimalist style living room includes wooden floors, dark sofas, combination coffee tables, bar counters, and natural light shining in from floorto-ceiling windows. The room has stone and wood accent walls and a white ceiling. Top-K CLIP Model SD-XL RLCF The main material of the ground is wooden floor, the soft decoration material of the space is dark sofa + combination coffee table + bar, the light is floor-toceiling windows + natural light during the day, the style is modern minimalist style, the color tone of the space is wood tone, the picture perspective is onepoint perspective, the main module material of the space It is a stone background wall. The main material of the wall is stone + wood veneer. The space type is residential. The main material of the top surface is latex paint. The specific space description is living room. The theme is modern minimalist style. 1024x1024 System Prompt GPT-3.5-Turbo Data Recaptioning SD-XL The modern minimalist style living room includes wooden floors, dark sofas, combination coffee tables, bar counters, and natural light shining in from floorto-ceiling windows. The room has stone and wood accent walls and a white ceiling. Normal-resolution Fine-tuning 2048x2048 Curriculum Learning High-resolution Fine-tuning Figure 2. An illustration of the overall training process of the iDesigner, which includes data recaptioning, curriculum learning, and reinforcement learning from clip feedback (RLCF) a loss function L2 which aims at refining the generated images to capture detailed design elements: h i L2 (θ) := EE(xhigh ),y,ϵ∼N (0,1),t ∥ϵ − ϵθ (zt , t, τθ (y))∥22 , (2) where L2 quantifies the fidelity of the generated highresolution image in terms of texture, pattern detail, and local design element accuracy. This higher resolution phase challenges the model to refine its generative capabilities, focusing on the minute details such as textures of fabrics, the play of light on different surfaces, and the precise appearance of small furniture and objects that are pivotal in a realistic interior design rendering. The overall curriculum learning process can be described by a compound loss function L over the course of training epochs E, which is a combination of L1 and L2 with a weighting function α(e) that adjusts the contribution of each loss function over time: L(θ, e) = α(e) · L1 (θ) + (1 − α(e)) · L2 (θ) respect to epoch e, where α(0) = 1 at the beginning of training and gradually decreases to 0 as training proceeds. This effectively shifts the training focus from the global structure to the intricate details as the model’s capacity increases. The model parameters θ are updated iteratively using a stochastic gradient descent method or one of its variants to minimize the loss function L(θ, e) over epochs: θe+1 = θe − η · ∇θ L(θe , e) (4) where η is the learning rate and ∇θ denotes the gradient with respect to the parameters θ. This incremental approach is crucial, as our experiments have shown that directly training a model at 2048 × 2048 pixels on interior design data, when the underlying base model is trained at 1024 × 1024, leads to structural imbalances in the generated images. These imbalances manifest as discrepancies in the spatial arrangement of furniture, inconsistencies in texture and pattern detail, and a general loss of image cohesion. The curriculum learning strategy in iDesigner effectively prevents such imbalances by (3) Herein, α(e) is a monotonically decreasing function with 4 Algorithm 1 Curriculum Learning for iDesigner there are still mismatches between the details of the text and the images. Therefore, we hope to introduce reinforcement learning, which has achieved tremendous success on LLMs, to enhance the effect. Unlike ChatGPT [20] and LLaMA2 [31], which use human feedback, we directly employ CLIP, fine-tuned in the design field, for model feedback, and then iterate using the method of reject sampling. In each iteration, we score the images generated by iDesigner and the original text with CLIP. The higher the score, the greater the relevance between the image and text. We select the TopK images based on the CLIP scores, and along with the original image and text, we form a new set of (K+1) image-text pairs for fine-tuning iDesigner. We now delineate the RLCF (Reinforced Learning from CLIP Feedback) algorithm, which is structured into three distinct steps for each stage t + 1: Step 1: Data collection. A batch of prompts Dt = {y1t , · · · , ybt } is sampled from the text domain, and for each prompt yit ∈ Dt , a set of images x1 , . . . , xK are generated by the image synthesis model iDesigner G. Step 2: Data ranking. Using the reward model CLIP, we compute a set of rewards {r(y, x1 ), · · · , r(y, xK )} for each prompt y ∈ Dt . Subsequently, we select the image with the highest reward: x := arg maxxj ∈{x1 ,··· ,xK } r(y, xj ), and repeat this for all b prompts to form a subset B of size b. Step 3: Model fine-tuning. The iDesigner model Gθ is fine-tuned on the subset B, and thereafter, the next stage of the learning process commences. This iterative process continues until the reward, as determined by the CLIP model, converges. The RLCF algorithm boasts minimal hyperparameter tuning and is straightforward to implement. It capitalizes on the best-of-K policy, where the model iteratively learns to produce image samples that are increasingly aligned with the highest rewards as gauged by CLIP, thereby refining the iDesigner model’s generation capabilities. 1: Initialize iDesigner Gθ with noise predictor ϵθ , text en- coder τθ , latent encoder E, and dataset D 2: for epoch in range(nepochs ) do 3: Sample a batch of data B from D 4: for (xi , yi ) ∈ B do 5: Generate text embeddings τθ (yi ) 6: Generate latent embeddings zi = E(xi ) t ∼ Uniform({1, . . . , T }) ϵ ∼ N (0, I) Obtain zit by adding noise ϵ to zi Feed (zit , τθ (yi )) to the ϵθ to generate noise predictions ϵpred 11: Compute L over the batch B between ϵ and ϵpred using Eq. (3) 12: Update θ by backpropagating L using Eq. (4) 13: end for 14: end for 15: return the fine-tuned model Gθ . 7: 8: 9: 10: allowing the model to develop a hierarchical understanding of interior design elements. It starts with mastering the global layout and composition before delving into the local details. The use of curriculum learning in iDesigner not only ensures that the generated images maintain structural integrity at higher resolutions but also greatly improves the model’s generalization capabilities and convergence rates, a testament to the efficacy of curriculum learning strategies in complex interior design domains of text-to-image. This formalism ensures that iDesigner is guided through a structured learning pathway, handling increasing levels of complexity in a controlled manner that is reflective of the staged learning process in human education. By employing this staged approach, iDesigner avoids the pitfalls of structural imbalances that occur when training directly at higher resolutions. The curriculum learning technique ensures the model progressively acquires a nuanced understanding of interior design elements, from global layout to intricate details, thereby maintaining image integrity and improving the model’s generalization capabilities and convergence rates, demonstrating the power of curriculum learning strategies in the complex domain of interior design in text-to-image synthesis. The overall training algorithm of iDesigner is summarized in Algorithm 1. 2.3.3 3. Experiment and Result Training Settings. For our iDesigner model, we employ the pre-trained checkpoint of Stable Diffusion XL (SD-XL) [21] as our foundational backbone, ensuring a robust starting point for image generation tasks. To optimize resource utilization and expedite the training process, we utilize the BFLOAT16 format, which significantly reduces GPU memory requirements while maintaining training efficiency. Our training regimen adopts a learning rate of 1e-5, which is carefully calibrated through a warmup phase to stabilize the learning dynamics initially. This is followed by a cosine decay schedule to gradually reduce the learning rate, facilitating fine-tuning and convergence to a more precise model state. These settings are critical in achieving the delicate balance between training speed and model performance. RLCF: Reinforcement Learning From CLIP Feedback As the prompts in the field of interior design consist of various tags, even after the aforementioned Data recaption and two-stage SFT in the design field, the images generated by iDesigner still show an increased relevance with the text, but 5 Baselines. In our comparative analysis, we consider two strong baselines: DALL-E 3 [14] and SD-XL [21]. DALLE 3 is renowned for its innovative text-to-image capabilities, generating high-quality images from textual descriptions. It serves as a benchmark for cutting-edge generative models. SD-XL, on the other hand, is a variant of the Stable Diffusion model known for its extended capabilities in handling complex image synthesis tasks. By comparing iDesigner with these established models, we aim to demonstrate the effectiveness and advancements of our approach, particularly in terms of bilingual image generation and adherence to textual prompts. Evaluation Protocols. Our evaluation framework encompasses both machine and human assessments to provide a comprehensive understanding of the model’s performance. Machine evaluation metrics include CLIP performance evaluation with image-to-text retrieval and text-toimage retrieval; CLIP Similarity (CLIP Sim), which measures the semantic alignment between the generated images and text descriptions; Inception Score (IS), assessing the quality and diversity of the images; and Fréchet Inception Distance (FID), evaluating the distance between the distributions of generated and real images. Human evaluation, on the other hand, involves subjective assessments by a group of evaluators. They rate the images based on visual appeal, relevance to the provided prompts, and overall aesthetic quality. This dual approach ensures a well-rounded evaluation, combining objective computational assessments with human perceptual judgments. Model CLIP Sim(↑) IS(↑) FID(↓) 4.838 3.450 3.832 4.562 4.559 4.690 0 95.867 89.906 79.340 79.262 76.832 4.838 3.007 3.201 4.004 4.309 4.315 0 95.439 90.236 80.172 79.816 78.102 English Dataset Test Set SD-XL [21] DALL-E 3 [14] iDesigner 1024 iDesigner 2048 iDesigner RLCF 0.205 0.112 0.118 0.135 0.137 0.145 Chinese Dataset Test Set SD-XL [21] DALL-E 3 [14] iDesigner 1024 iDesigner 2048 iDesigner RLCF 0.181 0.096 0.106 0.129 0.126 0.136 Table 1. Comparison of different models based on CLIP Sim and IS and FID across English and Chinese datasets.The best results except the original Test Set are marked in bold. tion Distance (FID) metrics. Notably, the iDesigner RLCF model demonstrates superior performance in both linguistic contexts, achieving the highest CLIP Sim, IS, and the lowest FID scores, distinctly outperforming other models like SD-XL and DALL-E 3. In the English dataset, iDesigner RLCF attains a CLIP Sim of 0.145, indicating a more refined alignment between text and image compared to SD-XL and DALL-E 3, which score 0.112 and 0.118, respectively. This enhancement in semantic consistency is further corroborated by its IS score of 4.690, surpassing iDesigner 1024 and iDesigner 2048 variants, suggesting a more accurate capture of nuanced image details. The model’s proficiency in generating highquality images is also reflected in its FID score of 76.832, the lowest among the compared models, indicating closer proximity to real image distributions. 3.1. Machine Evaluation The performance of our CLIP model3 achieves the best performance on both English and Chinese datasets. On the Flickr [36] and MSCOCO datasets [19] cross-lingual image-text retrieval tasks, the original CLIP model demonstrates a foundational understanding, with modest retrieval rates that highlight the challenge of transferring learning across languages. In contrast, AltCLIP [8] and our-CLIP exhibit remarkable improvements, with our-CLIP attaining the highest recall rates across most metrics. Notably, in the Text → Image retrieval task, our-CLIP achieves an impressive 88.1% and 69.7% recall at 1 on the Flickr-CN [36] and MSCOCO-CN datasets [18] respectively, indicating a robust alignment between text prompts and visual content. These results underscore the efficacy of tailored modifications to enhance CLIP’s cross-lingual performance, emphasizing the potential of specialized models in handling diverse linguistic contexts within multimodal AI applications. The presented data in Table 1 offers a comprehensive overview of the performance of various models across English and Chinese datasets, evaluated using CLIP Similarity (CLIP Sim), Inception Score (IS), and Fréchet Incep- The trend is consistent in the Chinese dataset, where iDesigner RLCF again leads with a CLIP Sim of 0.136, an IS of 4.315, and an FID of 78.102. Compared to the English dataset, a slight variation in scores is observed, possibly attributing to the linguistic and cultural differences inherent in the datasets. However, the model’s robustness across languages is evident, marking a significant advancement in bilingual image generation capabilities. These results collectively underline the effectiveness of the RLCF approach in iDesigner, particularly in enhancing the model’s ability to comprehend and respond to complex textual prompts accurately, thereby generating images that are not only visually appealing but also semantically coherent across diverse linguistic contexts. 3 The result detail of CLIP is presented in Appendix B 6 Model Variant iDesigner iDesigner w/o Cap iDesigner w/o CL iDesigner w/o RLCF CLIP Sim(↑) IS(↑) FID(↓) 0.145 0.142 0.140 0.120 4.690 4.500 4.200 4.650 76.832 76.900 85.000 77.100 Table 2. Ablation study on iDesigner. Cap: Captioner, CL: Curriculum Learing, RLCF: Reinforcement learning From CLIP Feedback. 3.3. Ablation Study In our ablation study, as shown in Table 2, we systematically evaluate the contribution of each component in the iDesigner model. The complete iDesigner model with RLCF achieves the highest scores across all metrics. The removal of RLCF results in a notable decrease in CLIP Score to 0.120, highlighting its significant role in improving textimage semantic alignment. Removing Curriculum Learning impacts both the Inception Score and FID, with a decrease to 4.200 and an increase to 85.000, respectively, indicating its importance in enhancing image quality and diversity. The absence of the captioner has a milder effect, evidenced by a slight decrease in the Inception Score to 4.500 and a marginal increase in FID to 76.900. These results collectively illustrate the synergistic effect of these components in optimizing the performance of the iDesigner model, with each playing a critical role in achieving high-quality, semantically coherent image generation. Overall, the ablation study underscores the integral role of each component in achieving the final performance of iDesigner. The harmonized interplay between the captioner, curriculum learning, and reject sampling is what endows iDesigner with its remarkable ability to generate images that are not only visually compelling but also deeply resonant with the textual prompts provided by users. Figure 3. Results of the human preference evaluation, illustrating the win/lose/tie rates of our iDesigner method against the other competing models. The iDesigner model demonstrates a clear preference among human evaluators, substantiating its superior image generation capabilities. 3.2. Human Preference Evaluation In addition to the automated metrics presented in Table 1, we performs a human preference evaluation to directly assess the perceptual quality of the generated images. Participants in the study were asked to compare images generated by iDesigner, DALL-E 3, and SD-XL models in response to a variety of prompts. The participants were blinded to the model origins of the images to prevent any potential bias. The results, as depicted in Figure 3, unequivocally indicate a preference for iDesigner, with the model achieving a win rate of over 58% against both DALL-E 3 and SD-XL baselines. This substantial margin not only reinforces the quantitative findings from the CLIP scores but also illustrates the qualitative leap in generation fidelity that iDesigner represents. The generated images were frequently cited as more coherent, aesthetically pleasing, and true to the textual descriptions provided. In Figure 4, we present images generated by the same prompt using iDesigner, SD-XL and DALL-E 3. Compared to other T2I models, our results are more closely aligned with the stylistic essence of human designer output and exhibit superior image quality. 4. Related Work Image Generation and Diffusion Model The field of text-to-image generation has witnessed significant advancements in recent years. Compared to earlier models such as GAN [2, 11], VAE [16], Flow-based model [25], and autoregressive models [9, 10, 23], this work places a greater emphasis on the advanced diffusion model. With the advancement and maturation of diffusion theory and techniques [5, 13, 30, 33], the diffusion model has started to become one of the mainstream technologies in the field of image generation. Notable developments include: Dall-E 2 [24] introduces a hierarchical approach to generate images conditioned on textual descriptions using CLIP latents while [14] pointed out that better captions can improving image generation quality. Imagen [28] and Deepfloyd-IF [1] This subjective evaluation underscores the effectiveness of iDesigner in preserving the semantic essence of the original prompts while manifesting images that resonate more strongly with human judges. The ability of iDesigner to maintain high CLIP scores while also securing human preference attests to its advanced capability in generating semantically and visually compelling images. The feedback from human evaluators provides invaluable insights that go beyond numerical scores, highlighting the nuanced improvements that iDesigner brings to the realm of text-toimage synthesis. 7 Origin caption Recaption 具体空间描述是星级酒店,主题是北欧滑雪亲子度假,地面主要材质是木地板+石材,墙面主要材质是扪布+木饰面,空间软装 材质是米色布艺+浅色木饰面,图片视角是一点透视,空间色调是暖色调,光线是落地全景窗+白天自然光,空间平米是45,风格 是北欧风格,空间类型是酒店,顶面主要材质是乳胶漆,空间主要材质是艺术玻璃移门+扪布墙身,空间功能是客房会客区。 The specific space description is a star hotel, the theme is Nordic skiing family vacation, the main material of the ground is wooden floor + stone, the main material of the wall is cloth + wood veneer, the soft decoration material of the space is beige fabric + light wood veneer. The picture perspective is one-point perspective, the space color is warm, the light is floor-to-ceiling panoramic windows + daytime natural light, the space is 45 square meters, the style is Nordic style, the space type is hotel, the main material of the top surface is latex paint, and the main material of the space is art glass Sliding door + cloth wall, the space function is the guest room reception area. 酒店客房会议区,北欧风格设计,适合滑雪家庭度假,面积45平方米。 该空间采用木地板和石材,墙壁装饰有纹理 织物和木质饰面。 柔软的家具包括米色织物和浅色木饰面。 视角是单点的,呈现出温暖的色调。 自然光透过落地全 景窗倾泻而入。 天花板采用乳胶漆,艺术玻璃推拉门和布艺墙面覆盖了空间,体现了舒适而实用的北欧设计精神。 A hotel guest room meeting area, designed in a Nordic style for a ski family vacation, spanning 45 square meters. The space features wooden flooring and stone materials, with walls adorned in textured fabric and wood finishes. The soft furnishings include beige fabric and light wood finishes. The perspective is one-point, showcasing a warm color palette. Natural light floods in through floor-to-ceiling panoramic windows. The ceiling is finished with latex paint, and the space is enhanced with artistic glass sliding doors and fabric wall coverings, embodying the cozy and functional Nordic design ethos. Figure 4. The images, from left to right, are generated respectively by a human designer, iDesigner, SD-XL, and DALL-E 3. Chinese diffusion image generation models are mostly derived from further training based on stable-diffusion. This typically involves two steps. The first step is to replace the CLIP text encoder with a bilingual encoder or Chinese encoder, followed by pre-training for text-image matching on a Chinese text-image dataset. Representative works in this category include Taiyi-CLIP [37], Chinese-CLIP [34], and Alt-CLIP [8]. The second step involves replacing the text encoder in stable diffusion and then continuing the training on a Chinese text-image dataset for text-to-image generation. As a result, we obtain the Chinese version of the diffusion image generation model, with representative works like Taiyi-diffusion [37] and Alt-diffusion [35]. However, replacing the CLIP text encoder often means that the entire text-to-image model will lose its English capabilities, and the training cost can be relatively high. presents a diffusion model that generates photorealistic images from textual descriptions, emphasizing deep language understanding. The current most popular diffusion model is the latent diffusion model [26], which includes a series of works such as stable-diffusion-v1-5, stable-diffusion-2-1, and stable-diffusion-xl [21]. These model primarily extracts textual features using the CLIP text model and then incorporates these textual features into the latent diffusion process. Conducting the diffusion process in the latent space can reduce computational overhead and memory requirements. Moreover, due to significant advancements in reinforcement learning within large language models, there have been attempts [4, 15, 32, 38] to integrate reinforcement learning into extended models. This integration aims to enhance the quality of generation and the degree of control over text. Although diffusion models have been employed in various design fields such as character design, scene design, and architectural design, most applications are based on simple fine-tuning of the stable diffusion model. Particularly, given the multitude of elements and the complexity of scenes characteristic of interior design, there remains a lack of a customized text-to-image model specifically tailored for the interior design domain. Text-image dataset Whether in text-image matching or text-to-image generation, datasets play a crucial role. Traditional image caption datasets, such as MSCOCO [19] and Flickr [36] in the English domain and MSCOCO-CN [18] and Flickr-CN [17] in the Chinese domain, are suitable for training but have relatively smaller sizes, typically below one million. As a result, web-crawled datasets like Laion [29] (primarily in English) and Wukong [12] (primarily in Chinese) have become more critical sources of data for training diffusion text-to-image models. These datasets Bilingual Text-to-image model Chinese researchers, aiming to better adapt to the text-to-image needs in Chinese scenarios, have proposed numerous works. The mainstream 8 have reached scales of 100 million or even 5 billion. hierarchical transformers. Advances in Neural Information Processing Systems, 35:16890–16902, 2022. 7 [11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. 7 [12] Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei Zhang, Xin Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Advances in Neural Information Processing Systems, 35:26418–26431, 2022. 3, 8 [13] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. 7 [14] Li Jing Tim Brooks Jianfeng Wang Linjie Li Long Ouyang Juntang Zhuang Joyce Lee Yufei Guo Wesam Manassra Prafulla Dhariwal Casey Chu Yunxin Jiao Aditya Ramesh James Betker, Gabriel Goh. Improving image generation with better captions. openai cdn.openai.com/papers/dall-e3.pdf, 2023. 1, 3, 6, 7 [15] Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022. 1, 8 [16] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. 3, 7 [17] Xirong Li, Weiyu Lan, Jianfeng Dong, and Hailong Liu. Adding chinese captions to images. In Proceedings of the 2016 ACM on international conference on multimedia retrieval, pages 271–275, 2016. 8 [18] Xirong Li, Chaoxi Xu, Xiaoxu Wang, Weiyu Lan, Zhengxiong Jia, Gang Yang, and Jieping Xu. Coco-cn for crosslingual image tagging, captioning, and retrieval. IEEE Transactions on Multimedia, 21(9):2347–2360, 2019. 6, 8, 12 [19] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 6, 8, 12 [20] OpenAI. Introducing ChatGPT, 2022. 1, 2, 5 [21] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 1, 5, 6, 8 [22] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 3, 12 [23] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021. 7 5. Conclusion In this paper, we propose the first Chinese-English bilingual text-to-image model in the interior design field, and optimize the data processing and model training methods to meet the needs of the design field such as data scarcity, complex text descriptions and image elements, and diverse resolution pictures, and finally obtaine a good generation effect in this field. In the future, we will continue to try to optimize the model in the following directions, including continuing to increase the quantity and quality of text-image data, accessing the knowledge of large language models, and using multidimensional feedback reinforcement learning. References [1] Daria Bakshandaeva Christoph Schuhmann Ksenia Ivanova Nadiia Klokova Alex Shonenkov, Misha Konstantinov. If: Title of the repository, 2023. 7 [2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017. 7 [3] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009. 1, 3 [4] Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023. 1, 8 [5] Hanqun Cao, Cheng Tan, Zhangyang Gao, Guangyong Chen, Pheng-Ann Heng, and Stan Z Li. A survey on generative diffusion model. arXiv preprint arXiv:2209.02646, 2022. 1, 7 [6] Junming Chen, Zichun Shao, and Bin Hu. Generating interior design from text: A new diffusion model-based method for efficient creative design. Buildings, 13(7):1861, 2023. 1 [7] Yihao Chen, Xianbiao Qi, Jianan Wang, and Lei Zhang. Disco-clip: A distributed contrastive loss for memory efficient clip training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22648–22657, 2023. 3 [8] Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, and Ledell Wu. Altclip: Altering the language encoder in clip for extended language capabilities. arXiv preprint arXiv:2211.06679, 2022. 6, 8, 12 [9] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021. 7 [10] Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: Faster and better text-to-image generation via 9 Dong, Junqing He, et al. Fengshenbang 1.0: Being the foundation of chinese cognitive intelligence. arXiv preprint arXiv:2209.02970, 2022. 8 [38] Zhengbang Zhu, Hanye Zhao, Haoran He, Yichao Zhong, Shenyu Zhang, Yong Yu, and Weinan Zhang. Diffusion models for reinforcement learning: A survey. arXiv preprint arXiv:2311.01223, 2023. 1, 8 [24] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 7 [25] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pages 1530–1538. PMLR, 2015. 7 [26] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 1, 8 [27] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015. 3 [28] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022. 7 [29] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 3, 8 [30] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 7 [31] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 5 [32] Siddarth Venkatraman, Shivesh Khaitan, Ravi Tej Akella, John Dolan, Jeff Schneider, and Glen Berseth. Reasoning with latent diffusion in offline reinforcement learning. arXiv preprint arXiv:2309.06599, 2023. 1, 8 [33] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661– 1674, 2011. 7 [34] An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, and Chang Zhou. Chinese clip: Contrastive vision-language pretraining in chinese. arXiv preprint arXiv:2211.01335, 2022. 8 [35] Fulong Ye, Guangyi Liu, Xinya Wu, and Ledell Yu Wu. Altdiffusion: A multilingual text-to-image diffusion model. ArXiv, abs/2308.09991, 2023. 8 [36] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014. 6, 8, 12 [37] Jiaxing Zhang, Ruyi Gan, Junjie Wang, Yuxiang Zhang, Lin Zhang, Ping Yang, Xinyu Gao, Ziwei Wu, Xiaoqun 10 A. Example Images Generated by iDesigner Figure 5 presents the example images generated by the proposed iDesigner, which illustrate the effective of our approach in generating high-quality images. B. CLIP Results in General Dataset Table 3 and 4 presents the performance of our CLIP model on datasets comprising English and Chinese captions respectively. A CLIP model endowed with robust bilingual comprehension capabilities can significantly enhance the ability of our iDesigner to understand user-input prompts and subsequently generate images that accurately conform to the given prompts. 11 Figure 5. Examples of image comparisons with different resolutions. The top of images are generated by human designer, the below of images are generated by iDesigner. Flickr30K Image → Text Text → Image MSCOCO Image → Text Text → Image Model R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 CLIP [22] AltCLIP [8] our-CLIP 85.1 86.0 88.4 97.3 98.0 98.8 99.2 99.1 99.9 65.0 72.5 75.7 87.1 91.6 93.8 92.2 95.4 96.9 56.4 58.6 61.2 79.5 80.6 84.8 86.5 87.8 90.3 36.5 42.9 49.2 61.1 68.0 70.3 71.1 77.4 79.6 Table 3. Zero-shot image-text retrieval results on Flickr30K [36] and MSCOCO [19] datasets. Flickr30K-CN Image → Text Text → Image h MSCOCO-CN Image → Text Text → Image Model R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 CLIP [22] AltCLIP [8] our-CLIP 2.3 69.8 73.2 8.1 89.9 90.3 12.6 94.7 96.5 0 84.8 88.1 2.4 97.4 98.2 4.0 98.8 99.1 0.6 63.9 66.0 4.1 87.2 91.1 7.1 93.9 96.6 1.8 62.8 69.7 6.7 88.8 91.3 11.9 95.5 96.8 Table 4. Zero-shot image-text retrieval results on Flickr30K-CN [36] and MSCOCO-CN [18] datasets. 12