iDesigner: A High-Resolution and Complex-Prompt Following Text-to-Image
Diffusion Model for Interior Design

arXiv:2312.04326v1 [cs.CV] 7 Dec 2023

Ruyi Gan♥♠
Xiaojun Wu♥
Junyu Lu♥
Yuanhe Tian♠
♥♦
♥
♥
Dixiang Zhang
Ziwei Wu
Renliang Sun
Chang Liu♠
Jiaxing Zhang♥ *
Pingjian Zhang♦
Yan Song♠ *
♥

International Digital Economy Academy ♦ South China University of Technology
♠
University of Science and Technology of China

{ganruyi, wuxiaojun, lujunyu, zhangdixiang, zhangjiaxing, wuziwei, sunrenliang}@idea.edu.cn
yhtian@uw.edu

lc980413@mail.ustc.edu.cn

Abstract

pjzhang@scut.edu.cn

clksong@gmail.com

strides in general media generation [5, 21, 26], they have
not been intricately tailored to meet the unique demands
of interior design. This disconnect stems from the models’ limited ability to process the rich and often complex
vocabulary specific to interior design lexicon. As such, the
resultant images frequently lack the sophistication and exactitude that professional design work commands. In this
paper, we propose iDesigner, which is designed to generate visually rich and contextually accurate interior design
images directly from descriptive text prompts.

With the open-sourcing of text-to-image models (T2I)
such as stable diffusion (SD) and stable diffusion XL (SDXL), there is an influx of models fine-tuned in specific domains based on the open-source SD model, such as in
anime, character portraits, etc. However, there are few specialized models in certain domains, such as interior design,
which is attributed to the complex textual descriptions and
detailed visual elements inherent in design, alongside the
necessity for adaptable resolution. Therefore, text-to-image
models for interior design are required to have outstanding
prompt-following capabilities, as well as iterative collaboration with design professionals to achieve the desired outcome. In this paper, we collect and optimize text-image data
in the design field and continue training in both English and
Chinese on the basis of the open-source CLIP model. We
also proposed a fine-tuning strategy with curriculum learning and reinforcement learning from CLIP feedback to enhance the prompt-following capabilities of our approach so
as to improve the quality of image generation. The experimental results on the collected dataset demonstrate the effectiveness of the proposed approach, which achieves impressive results and outperforms strong baselines.1

iDesigner is crafted to specifically address the unique
and complex demands of interior design prompts, a challenge that general-purpose models often fail to meet, resulting in images that fall short of capturing the true essence of
the designer’s vision. Our model backbone is satble diffsuion XL (SD-XL) which is the most popular T2I opensourced model in the world [21, 26]. The methodological backbone of iDesigner is its innovative use of curriculum learning [3], a pedagogically inspired approach that
scales the complexity of tasks in a graduated manner. It begins with the generation of basic, low-resolution images to
ground the model in the foundational aspects of design aesthetics and function. As the model’s proficiency escalates,
the curriculum progresses to sophisticated, high-resolution
image creation, paying meticulous attention to fine details
that define the quality and accuracy of professional interior
design imagery. Complementing this, iDesigner incorporates a specialized captioner module [14, 20] designed to
parse and optimize complex textual prompts. This enables
the model to produce images with a higher fidelity to the
designer’s intent. The integration of the Reinforced Learning from CLIP Feedback (RLCF) method [4, 15, 32, 38]
further refines the model’s ability to follow prompts, establishing a reinforcing loop between textual instructions and

1. Introduction
Interior design synthesizes form, function, and aesthetics
within physical spaces, requiring meticulous attention to detail and a deep understanding of cultural and contextual elements [6]. While text-to-image (T2I) models have made
* Corresponding Authors.
1 We will release the code and model soon.

1

Figure 1. Example images generated by the proposed iDesigner model.

image content that enhances the precision and relevance
of the generated images. Figure 1 presents images generated by iDesigner, which demonstrate the effectiveness of
iDesigner. More images generated by iDesigner are presented in here 2 . The contributions of this paper are summarized as follows:
• A novel text-to-image model, iDesigner, tailored for the
interior design domain and made available to the community for collaborative enhancement and research.
• A strategic application of prompt engineering and Large
Language Models (LLMs) to produce more detailed and
vivid captions, which markedly enhances the quality of
the generated images.
• A pioneering integration of a curriculum learning framework with a diffusion model that progressively sharpens the model’s generative capabilities, ensuring superior image quality. While diffusion models are known
for their proficiency in generating detailed images, our
curriculum-based approach elevates this to a new level.
• A RLCF method that reinforces the model’s prompt adherence, fine-tuning image generation in accordance with
detailed textual instructions.

augmenting the effectiveness of the text-to-image generation. Finally, our trained CLIP model is utilized as a feedback mechanism in reinforcement learning training, significantly improving the model’s ability to follow instructions
within the interior design domain.

2.1. Dataset
Data in Intreior Design Domain Due to the current
scarcity of high-quality datasets in the field of interior design in the academic community, we established a collaborative partnership with an internationally acclaimed interior
design company, leveraging their extensive historical design
records. Our ambitious annotation effort involved the participation of over 1,000 seasoned interior design professionals who meticulously labeled the data. This painstaking process resulted in the curation of a high-quality dataset comprising 3,600 high-quality image-text pairs. To ensure robust model training and evaluation, we carefully divided the
dataset into training and testing sets, maintaining a balanced
ratio of 9:1. This rigorous data collection and partitioning
strategy laid the foundation for our research, enabling us to
explore the intricate relationships between text and images
in the domain of interior design.

2. The Approach
Data Recaptioning Our dataset in interior design composed of high-quantity pairings (X, Y ), where X represent an image and Y is a text composed of multiple parallel labels that describes the image. In the field of interior design, Y generally comes from the discretized tags
marked by designers or derived from web-crawled source,
focusing on the simple description of materials, styles and
colors. Worse, the web-crawled resource oftentimes contains irrelevant tags, which cannot accurately describe the
images. Since the discretized tags inevitably overlook the
spatial layout and local details of interior design, and irrelevant tags may mislead the cognition of models, we theorize
that such shortcomings can be addressed using synthetically

The optimization process of our iDesigner model primarily
encompasses several key components. Initially, we annotate
interior design renderings across multiple dimensions and
employ GPT-3.5 [20] for rewriting, facilitating the training of our text-to-image model. Subsequently, we conduct
a two-stage training process on the CLIP model, yielding
a foundation model that is both universally applicable and
enhanced for the design domain. Furthermore, this foundational model replaces the textual component of the SD-XL
model. We then progressively increase image resolution in
a curriculum learning approach during fine-tuning, thereby
2 The images generated by iDesigner are presented in Appendix A.

2

generated captions. For this purpose, We first collect and
structure various interior design datasets. Then, we refer to
DALL-E 3 [14] and formulate a set of well-designed system
prompts to invoke the GPT3.5 interface for rewriting the alt
text into descriptive synthetic captions.

memory requirements. The final phase involves training
on specific text-image datasets, with the aim of refining the
model’s capability to generate images that closely match the
input textual descriptions.
2.3.2

2.2. CLIP Training

In the context of iDesigner, the nuanced domain of interior design requires a model that is not only proficient in
generating images but also exceptional in rendering highresolution images where even the smallest design elements
are crisply defined. To this end, iDesigner employs a curriculum learning (CL) approach, inspired by the pioneering
concept introduced by [3], which mirrors the human learning process of progressing from simple to complex levels of
understanding. Let us denote the function Gθ as the mapping from textual descriptions to image outputs parameterized by θ, the set of parameters of our model. In the realm
of non-convex optimization problems, curriculum learning
has demonstrated a substantial enhancement in performance
along with a robust generalization capacity. The strategy
underpinning curriculum learning expedites the rate of convergence as well as facilitates the discovery of superior local
optima within the context of non-convex optimization landscapes. We denote (xi , yi ) as the paired association of an
image xi and its corresponding textual descriptor yi , which
together constitute the i-th instance within the training corpus. The curriculum comprises two steps, each with a dedicated loss function to optimize. For the initial step (Step
1) at resolution 1024 × 1024, we define a loss function L1
which focuses on the global structure and basic elements of
design space:
h
i
L1 (θ) := EE(xlow ),y,ϵ∼N (0,1),t ∥ϵ − ϵθ (zt , t, τθ (y))∥22 ,
(1)
where xlow is the downsampled version of xi to the lower
resolution and loss1 measures the difference between the
generated low-resolution image and the ground truth. These
models can be interpreted as an equally weighted sequence
of denoising latent autoencoders ϵθ (zt , t); t = 1 . . . T
where the ϵθ is realized as a time-conditional UNet [27].
. Since the forward process is fixed, t can be efficiently
obtained from E(x) during training and can be decoded to
image space with a single pass through vae decoder [16].
And τθ can be parameterized with transformer text encoder
model, both τθ and ϵθ are jointly optimized via Eq. 1. This
foundational step allows the model to establish an understanding of the broader aesthetic and functional principles
of interior design without being overwhelmed by the intricacies of high-resolution details.
Once the model has demonstrated proficiency in generating coherent and contextually accurate images at this resolution, the curriculum progresses to Step 2, where the resolution is elevated to 2048×2048 at resolution. We introduce

The vision-language foundation models such as CLIP [22]
are crucial component for aligning images and text representations, which can capture the correlation between crossmodal features. Since the open-source CLIP model cannot
meet the requirements of bilingual adaptation and multielement cognition in interior design, We initialize from the
pre-trained English-only CLIP and continue training in two
stages. In the first stage, we collect a large-scale, webcrawled set of bilingual image-text pairs, including Laion
[29], Wukong [12], and make effort to clean the data. We
take the contrastive loss as training objective and utilize a
distributed memory-efficient CLIP training approach to reduce the memory consumption [7]. In the second stage, we
continue to train our CLIP using high-quality common and
interior design image-text pairs preprocessed by the image
captioner. In interior design domain, the same image may
match with multiple distinctly different texts, observing the
image from different perspectives and details.

2.3. iDesigner Training
This section primarily discusses the core modules involved
in the training of the iDesigner model. Firstly, we introduce the fundamental process of text-to-image generation
and describe how we replaced the textual encoder part of
the open-source SD-XL with our previously trained CLIP
model. Secondly, we detail our approach to curriculum
learning and the reinforcement learning method based on
CLIP feedback.
2.3.1

Curriculum Learning

Text-to-image Generation Process

In the realm of text-to-image generation, particularly with
diffusion models, the methodology can be broadly categorized into two primary phases:
Text Encoding Traditional models often employ the
CLIP text encoder for feature extraction from textual descriptions. For Chinese-specific applications, the CLIP text
encoder is replaced with a dedicated Chinese encoder. This
adaptation ensures better alignment with Chinese linguistic
structures and semantics.
Text-to-image Generation Once textual features are extracted, they are incorporated into the latent diffusion process. The diffusion process in the latent space offers computational efficiency, reducing both processing time and
3

Rejection Sampling
Ranking
The modern minimalist
style living room includes
wooden floors, dark sofas,
combination coffee tables,
bar counters, and natural
light shining in from floorto-ceiling windows. The
room has stone and wood
accent walls and a white
ceiling.

Top-K

CLIP
Model

SD-XL

RLCF

The main material of the ground
is wooden floor, the soft
decoration material of the space
is dark sofa + combination coffee
table + bar, the light is floor-toceiling windows + natural light
during the day, the style is
modern minimalist style, the color
tone of the space is wood tone,
the picture perspective is onepoint perspective, the main
module material of the space It is
a stone background wall. The
main material of the wall is stone
+ wood veneer. The space type is
residential. The main material of
the top surface is latex paint. The
specific space description is living
room. The theme is modern
minimalist style.

1024x1024

System Prompt

GPT-3.5-Turbo

Data Recaptioning

SD-XL

The modern minimalist
style living room includes
wooden floors, dark sofas,
combination coffee tables,
bar counters, and natural
light shining in from floorto-ceiling windows. The
room has stone and wood
accent walls and a white
ceiling.

Normal-resolution
Fine-tuning
2048x2048

Curriculum Learning

High-resolution Fine-tuning

Figure 2. An illustration of the overall training process of the iDesigner, which includes data recaptioning, curriculum learning, and
reinforcement learning from clip feedback (RLCF)

a loss function L2 which aims at refining the generated images to capture detailed design elements:
h
i
L2 (θ) := EE(xhigh ),y,ϵ∼N (0,1),t ∥ϵ − ϵθ (zt , t, τθ (y))∥22 ,
(2)
where L2 quantifies the fidelity of the generated highresolution image in terms of texture, pattern detail, and local
design element accuracy. This higher resolution phase challenges the model to refine its generative capabilities, focusing on the minute details such as textures of fabrics, the play
of light on different surfaces, and the precise appearance of
small furniture and objects that are pivotal in a realistic interior design rendering.
The overall curriculum learning process can be described
by a compound loss function L over the course of training epochs E, which is a combination of L1 and L2 with
a weighting function α(e) that adjusts the contribution of
each loss function over time:
L(θ, e) = α(e) · L1 (θ) + (1 − α(e)) · L2 (θ)

respect to epoch e, where α(0) = 1 at the beginning of
training and gradually decreases to 0 as training proceeds.
This effectively shifts the training focus from the global
structure to the intricate details as the model’s capacity increases.
The model parameters θ are updated iteratively using a
stochastic gradient descent method or one of its variants to
minimize the loss function L(θ, e) over epochs:
θe+1 = θe − η · ∇θ L(θe , e)

(4)

where η is the learning rate and ∇θ denotes the gradient
with respect to the parameters θ.
This incremental approach is crucial, as our experiments
have shown that directly training a model at 2048 × 2048
pixels on interior design data, when the underlying base
model is trained at 1024 × 1024, leads to structural imbalances in the generated images. These imbalances manifest as discrepancies in the spatial arrangement of furniture, inconsistencies in texture and pattern detail, and a general loss of image cohesion. The curriculum learning strategy in iDesigner effectively prevents such imbalances by

(3)

Herein, α(e) is a monotonically decreasing function with
4

Algorithm 1 Curriculum Learning for iDesigner

there are still mismatches between the details of the text and
the images. Therefore, we hope to introduce reinforcement
learning, which has achieved tremendous success on LLMs,
to enhance the effect. Unlike ChatGPT [20] and LLaMA2
[31], which use human feedback, we directly employ CLIP,
fine-tuned in the design field, for model feedback, and then
iterate using the method of reject sampling. In each iteration, we score the images generated by iDesigner and the
original text with CLIP. The higher the score, the greater
the relevance between the image and text. We select the
TopK images based on the CLIP scores, and along with
the original image and text, we form a new set of (K+1)
image-text pairs for fine-tuning iDesigner. We now delineate the RLCF (Reinforced Learning from CLIP Feedback)
algorithm, which is structured into three distinct steps for
each stage t + 1:
Step 1: Data collection. A batch of prompts Dt =
{y1t , · · · , ybt } is sampled from the text domain, and for each
prompt yit ∈ Dt , a set of images x1 , . . . , xK are generated
by the image synthesis model iDesigner G.
Step 2: Data ranking. Using the reward model CLIP,
we compute a set of rewards {r(y, x1 ), · · · , r(y, xK )}
for each prompt y ∈ Dt .
Subsequently, we select the image with the highest reward:
x :=
arg maxxj ∈{x1 ,··· ,xK } r(y, xj ), and repeat this for all b
prompts to form a subset B of size b.
Step 3: Model fine-tuning. The iDesigner model Gθ is
fine-tuned on the subset B, and thereafter, the next stage of
the learning process commences.
This iterative process continues until the reward, as determined by the CLIP model, converges. The RLCF algorithm boasts minimal hyperparameter tuning and is straightforward to implement. It capitalizes on the best-of-K policy, where the model iteratively learns to produce image
samples that are increasingly aligned with the highest rewards as gauged by CLIP, thereby refining the iDesigner
model’s generation capabilities.

1: Initialize iDesigner Gθ with noise predictor ϵθ , text en-

coder τθ , latent encoder E, and dataset D
2: for epoch in range(nepochs ) do
3:
Sample a batch of data B from D
4:
for (xi , yi ) ∈ B do
5:
Generate text embeddings τθ (yi )
6:
Generate latent embeddings zi = E(xi )

t ∼ Uniform({1, . . . , T })
ϵ ∼ N (0, I)
Obtain zit by adding noise ϵ to zi
Feed (zit , τθ (yi )) to the ϵθ to generate noise predictions ϵpred
11:
Compute L over the batch B between ϵ and ϵpred
using Eq. (3)
12:
Update θ by backpropagating L using Eq. (4)
13:
end for
14: end for
15: return the fine-tuned model Gθ .

7:
8:
9:
10:

allowing the model to develop a hierarchical understanding of interior design elements. It starts with mastering the
global layout and composition before delving into the local details. The use of curriculum learning in iDesigner not
only ensures that the generated images maintain structural
integrity at higher resolutions but also greatly improves the
model’s generalization capabilities and convergence rates, a
testament to the efficacy of curriculum learning strategies in
complex interior design domains of text-to-image.
This formalism ensures that iDesigner is guided through
a structured learning pathway, handling increasing levels of
complexity in a controlled manner that is reflective of the
staged learning process in human education. By employing
this staged approach, iDesigner avoids the pitfalls of structural imbalances that occur when training directly at higher
resolutions. The curriculum learning technique ensures the
model progressively acquires a nuanced understanding of
interior design elements, from global layout to intricate details, thereby maintaining image integrity and improving the
model’s generalization capabilities and convergence rates,
demonstrating the power of curriculum learning strategies
in the complex domain of interior design in text-to-image
synthesis. The overall training algorithm of iDesigner is
summarized in Algorithm 1.
2.3.3

3. Experiment and Result
Training Settings. For our iDesigner model, we employ
the pre-trained checkpoint of Stable Diffusion XL (SD-XL)
[21] as our foundational backbone, ensuring a robust starting point for image generation tasks. To optimize resource
utilization and expedite the training process, we utilize the
BFLOAT16 format, which significantly reduces GPU memory requirements while maintaining training efficiency. Our
training regimen adopts a learning rate of 1e-5, which is
carefully calibrated through a warmup phase to stabilize the
learning dynamics initially. This is followed by a cosine
decay schedule to gradually reduce the learning rate, facilitating fine-tuning and convergence to a more precise model
state. These settings are critical in achieving the delicate
balance between training speed and model performance.

RLCF: Reinforcement Learning From CLIP
Feedback

As the prompts in the field of interior design consist of various tags, even after the aforementioned Data recaption and
two-stage SFT in the design field, the images generated by
iDesigner still show an increased relevance with the text, but
5

Baselines. In our comparative analysis, we consider two
strong baselines: DALL-E 3 [14] and SD-XL [21]. DALLE 3 is renowned for its innovative text-to-image capabilities,
generating high-quality images from textual descriptions. It
serves as a benchmark for cutting-edge generative models.
SD-XL, on the other hand, is a variant of the Stable Diffusion model known for its extended capabilities in handling
complex image synthesis tasks. By comparing iDesigner
with these established models, we aim to demonstrate the
effectiveness and advancements of our approach, particularly in terms of bilingual image generation and adherence
to textual prompts.
Evaluation Protocols. Our evaluation framework encompasses both machine and human assessments to provide a comprehensive understanding of the model’s performance. Machine evaluation metrics include CLIP performance evaluation with image-to-text retrieval and text-toimage retrieval; CLIP Similarity (CLIP Sim), which measures the semantic alignment between the generated images
and text descriptions; Inception Score (IS), assessing the
quality and diversity of the images; and Fréchet Inception
Distance (FID), evaluating the distance between the distributions of generated and real images. Human evaluation, on
the other hand, involves subjective assessments by a group
of evaluators. They rate the images based on visual appeal, relevance to the provided prompts, and overall aesthetic quality. This dual approach ensures a well-rounded
evaluation, combining objective computational assessments
with human perceptual judgments.

Model

CLIP Sim(↑)

IS(↑)

FID(↓)

4.838
3.450
3.832
4.562
4.559
4.690

0
95.867
89.906
79.340
79.262
76.832

4.838
3.007
3.201
4.004
4.309
4.315

0
95.439
90.236
80.172
79.816
78.102

English Dataset
Test Set
SD-XL [21]
DALL-E 3 [14]
iDesigner 1024
iDesigner 2048
iDesigner RLCF

0.205
0.112
0.118
0.135
0.137
0.145
Chinese Dataset

Test Set
SD-XL [21]
DALL-E 3 [14]
iDesigner 1024
iDesigner 2048
iDesigner RLCF

0.181
0.096
0.106
0.129
0.126
0.136

Table 1. Comparison of different models based on CLIP Sim and
IS and FID across English and Chinese datasets.The best results
except the original Test Set are marked in bold.

tion Distance (FID) metrics. Notably, the iDesigner RLCF
model demonstrates superior performance in both linguistic
contexts, achieving the highest CLIP Sim, IS, and the lowest FID scores, distinctly outperforming other models like
SD-XL and DALL-E 3.
In the English dataset, iDesigner RLCF attains a CLIP
Sim of 0.145, indicating a more refined alignment between
text and image compared to SD-XL and DALL-E 3, which
score 0.112 and 0.118, respectively. This enhancement in
semantic consistency is further corroborated by its IS score
of 4.690, surpassing iDesigner 1024 and iDesigner 2048
variants, suggesting a more accurate capture of nuanced image details. The model’s proficiency in generating highquality images is also reflected in its FID score of 76.832,
the lowest among the compared models, indicating closer
proximity to real image distributions.

3.1. Machine Evaluation
The performance of our CLIP model3 achieves the best
performance on both English and Chinese datasets. On
the Flickr [36] and MSCOCO datasets [19] cross-lingual
image-text retrieval tasks, the original CLIP model demonstrates a foundational understanding, with modest retrieval
rates that highlight the challenge of transferring learning
across languages. In contrast, AltCLIP [8] and our-CLIP
exhibit remarkable improvements, with our-CLIP attaining
the highest recall rates across most metrics. Notably, in the
Text → Image retrieval task, our-CLIP achieves an impressive 88.1% and 69.7% recall at 1 on the Flickr-CN [36]
and MSCOCO-CN datasets [18] respectively, indicating a
robust alignment between text prompts and visual content.
These results underscore the efficacy of tailored modifications to enhance CLIP’s cross-lingual performance, emphasizing the potential of specialized models in handling diverse linguistic contexts within multimodal AI applications.
The presented data in Table 1 offers a comprehensive
overview of the performance of various models across English and Chinese datasets, evaluated using CLIP Similarity (CLIP Sim), Inception Score (IS), and Fréchet Incep-

The trend is consistent in the Chinese dataset, where
iDesigner RLCF again leads with a CLIP Sim of 0.136, an
IS of 4.315, and an FID of 78.102. Compared to the English
dataset, a slight variation in scores is observed, possibly
attributing to the linguistic and cultural differences inherent in the datasets. However, the model’s robustness across
languages is evident, marking a significant advancement in
bilingual image generation capabilities.
These results collectively underline the effectiveness of
the RLCF approach in iDesigner, particularly in enhancing
the model’s ability to comprehend and respond to complex
textual prompts accurately, thereby generating images that
are not only visually appealing but also semantically coherent across diverse linguistic contexts.

3 The result detail of CLIP is presented in Appendix B

6

Model Variant
iDesigner
iDesigner w/o Cap
iDesigner w/o CL
iDesigner w/o RLCF

CLIP Sim(↑)

IS(↑)

FID(↓)

0.145
0.142
0.140
0.120

4.690
4.500
4.200
4.650

76.832
76.900
85.000
77.100

Table 2. Ablation study on iDesigner. Cap: Captioner, CL:
Curriculum Learing, RLCF: Reinforcement learning From CLIP
Feedback.

3.3. Ablation Study
In our ablation study, as shown in Table 2, we systematically evaluate the contribution of each component in
the iDesigner model. The complete iDesigner model with
RLCF achieves the highest scores across all metrics. The removal of RLCF results in a notable decrease in CLIP Score
to 0.120, highlighting its significant role in improving textimage semantic alignment. Removing Curriculum Learning
impacts both the Inception Score and FID, with a decrease
to 4.200 and an increase to 85.000, respectively, indicating its importance in enhancing image quality and diversity.
The absence of the captioner has a milder effect, evidenced
by a slight decrease in the Inception Score to 4.500 and a
marginal increase in FID to 76.900. These results collectively illustrate the synergistic effect of these components
in optimizing the performance of the iDesigner model, with
each playing a critical role in achieving high-quality, semantically coherent image generation.
Overall, the ablation study underscores the integral role
of each component in achieving the final performance of
iDesigner. The harmonized interplay between the captioner,
curriculum learning, and reject sampling is what endows
iDesigner with its remarkable ability to generate images that
are not only visually compelling but also deeply resonant
with the textual prompts provided by users.

Figure 3. Results of the human preference evaluation, illustrating
the win/lose/tie rates of our iDesigner method against the other
competing models. The iDesigner model demonstrates a clear
preference among human evaluators, substantiating its superior
image generation capabilities.

3.2. Human Preference Evaluation
In addition to the automated metrics presented in Table 1,
we performs a human preference evaluation to directly assess the perceptual quality of the generated images. Participants in the study were asked to compare images generated
by iDesigner, DALL-E 3, and SD-XL models in response to
a variety of prompts. The participants were blinded to the
model origins of the images to prevent any potential bias.
The results, as depicted in Figure 3, unequivocally indicate
a preference for iDesigner, with the model achieving a win
rate of over 58% against both DALL-E 3 and SD-XL baselines. This substantial margin not only reinforces the quantitative findings from the CLIP scores but also illustrates the
qualitative leap in generation fidelity that iDesigner represents. The generated images were frequently cited as more
coherent, aesthetically pleasing, and true to the textual descriptions provided. In Figure 4, we present images generated by the same prompt using iDesigner, SD-XL and
DALL-E 3. Compared to other T2I models, our results are
more closely aligned with the stylistic essence of human designer output and exhibit superior image quality.

4. Related Work
Image Generation and Diffusion Model The field of
text-to-image generation has witnessed significant advancements in recent years. Compared to earlier models such as
GAN [2, 11], VAE [16], Flow-based model [25], and autoregressive models [9, 10, 23], this work places a greater
emphasis on the advanced diffusion model. With the advancement and maturation of diffusion theory and techniques [5, 13, 30, 33], the diffusion model has started to
become one of the mainstream technologies in the field of
image generation. Notable developments include: Dall-E
2 [24] introduces a hierarchical approach to generate images conditioned on textual descriptions using CLIP latents
while [14] pointed out that better captions can improving
image generation quality. Imagen [28] and Deepfloyd-IF [1]

This subjective evaluation underscores the effectiveness
of iDesigner in preserving the semantic essence of the original prompts while manifesting images that resonate more
strongly with human judges. The ability of iDesigner to
maintain high CLIP scores while also securing human preference attests to its advanced capability in generating semantically and visually compelling images. The feedback
from human evaluators provides invaluable insights that
go beyond numerical scores, highlighting the nuanced improvements that iDesigner brings to the realm of text-toimage synthesis.
7

Origin caption

Recaption

具体空间描述是星级酒店,主题是北欧滑雪亲子度假,地面主要材质是木地板+石材,墙面主要材质是扪布+木饰面,空间软装
材质是米色布艺+浅色木饰面,图片视角是一点透视,空间色调是暖色调,光线是落地全景窗+白天自然光,空间平米是45,风格
是北欧风格,空间类型是酒店,顶面主要材质是乳胶漆,空间主要材质是艺术玻璃移门+扪布墙身,空间功能是客房会客区。
The specific space description is a star hotel, the theme is Nordic skiing family vacation, the main material of the ground is
wooden floor + stone, the main material of the wall is cloth + wood veneer, the soft decoration material of the space is
beige fabric + light wood veneer. The picture perspective is one-point perspective, the space color is warm, the light is
floor-to-ceiling panoramic windows + daytime natural light, the space is 45 square meters, the style is Nordic style, the
space type is hotel, the main material of the top surface is latex paint, and the main material of the space is art glass
Sliding door + cloth wall, the space function is the guest room reception area.
酒店客房会议区，北欧风格设计，适合滑雪家庭度假，面积45平方米。 该空间采用木地板和石材，墙壁装饰有纹理
织物和木质饰面。 柔软的家具包括米色织物和浅色木饰面。 视角是单点的，呈现出温暖的色调。 自然光透过落地全
景窗倾泻而入。 天花板采用乳胶漆，艺术玻璃推拉门和布艺墙面覆盖了空间，体现了舒适而实用的北欧设计精神。
A hotel guest room meeting area, designed in a Nordic style for a ski family vacation, spanning 45 square meters. The
space features wooden flooring and stone materials, with walls adorned in textured fabric and wood finishes. The soft
furnishings include beige fabric and light wood finishes. The perspective is one-point, showcasing a warm color
palette. Natural light floods in through floor-to-ceiling panoramic windows. The ceiling is finished with latex paint,
and the space is enhanced with artistic glass sliding doors and fabric wall coverings, embodying the cozy and
functional Nordic design ethos.

Figure 4. The images, from left to right, are generated respectively by a human designer, iDesigner, SD-XL, and DALL-E 3.

Chinese diffusion image generation models are mostly derived from further training based on stable-diffusion. This
typically involves two steps. The first step is to replace the
CLIP text encoder with a bilingual encoder or Chinese encoder, followed by pre-training for text-image matching on
a Chinese text-image dataset. Representative works in this
category include Taiyi-CLIP [37], Chinese-CLIP [34], and
Alt-CLIP [8]. The second step involves replacing the text
encoder in stable diffusion and then continuing the training
on a Chinese text-image dataset for text-to-image generation. As a result, we obtain the Chinese version of the diffusion image generation model, with representative works
like Taiyi-diffusion [37] and Alt-diffusion [35]. However,
replacing the CLIP text encoder often means that the entire
text-to-image model will lose its English capabilities, and
the training cost can be relatively high.

presents a diffusion model that generates photorealistic images from textual descriptions, emphasizing deep language
understanding. The current most popular diffusion model is
the latent diffusion model [26], which includes a series of
works such as stable-diffusion-v1-5, stable-diffusion-2-1,
and stable-diffusion-xl [21]. These model primarily extracts
textual features using the CLIP text model and then incorporates these textual features into the latent diffusion process. Conducting the diffusion process in the latent space
can reduce computational overhead and memory requirements. Moreover, due to significant advancements in reinforcement learning within large language models, there
have been attempts [4, 15, 32, 38] to integrate reinforcement
learning into extended models. This integration aims to enhance the quality of generation and the degree of control
over text. Although diffusion models have been employed
in various design fields such as character design, scene design, and architectural design, most applications are based
on simple fine-tuning of the stable diffusion model. Particularly, given the multitude of elements and the complexity
of scenes characteristic of interior design, there remains a
lack of a customized text-to-image model specifically tailored for the interior design domain.

Text-image dataset Whether in text-image matching or
text-to-image generation, datasets play a crucial role. Traditional image caption datasets, such as MSCOCO [19]
and Flickr [36] in the English domain and MSCOCO-CN
[18] and Flickr-CN [17] in the Chinese domain, are suitable for training but have relatively smaller sizes, typically
below one million. As a result, web-crawled datasets like
Laion [29] (primarily in English) and Wukong [12] (primarily in Chinese) have become more critical sources of data
for training diffusion text-to-image models. These datasets

Bilingual Text-to-image model Chinese researchers,
aiming to better adapt to the text-to-image needs in Chinese
scenarios, have proposed numerous works. The mainstream
8

have reached scales of 100 million or even 5 billion.

hierarchical transformers. Advances in Neural Information
Processing Systems, 35:16890–16902, 2022. 7
[11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial nets. Advances in
neural information processing systems, 27, 2014. 7
[12] Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu
Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei
Zhang, Xin Jiang, et al. Wukong: A 100 million large-scale
chinese cross-modal pre-training benchmark. Advances in
Neural Information Processing Systems, 35:26418–26431,
2022. 3, 8
[13] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information
processing systems, 33:6840–6851, 2020. 7
[14] Li Jing Tim Brooks Jianfeng Wang Linjie Li Long Ouyang
Juntang Zhuang Joyce Lee Yufei Guo Wesam Manassra
Prafulla Dhariwal Casey Chu Yunxin Jiao Aditya Ramesh
James Betker, Gabriel Goh. Improving image generation
with better captions. openai cdn.openai.com/papers/dall-e3.pdf, 2023. 1, 3, 6, 7
[15] Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey
Levine. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022. 1, 8
[16] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. 3, 7
[17] Xirong Li, Weiyu Lan, Jianfeng Dong, and Hailong Liu.
Adding chinese captions to images. In Proceedings of the
2016 ACM on international conference on multimedia retrieval, pages 271–275, 2016. 8
[18] Xirong Li, Chaoxi Xu, Xiaoxu Wang, Weiyu Lan, Zhengxiong Jia, Gang Yang, and Jieping Xu. Coco-cn for crosslingual image tagging, captioning, and retrieval. IEEE Transactions on Multimedia, 21(9):2347–2360, 2019. 6, 8, 12
[19] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Zitnick. Microsoft coco: Common objects in context. In
Computer Vision–ECCV 2014: 13th European Conference,
Zurich, Switzerland, September 6-12, 2014, Proceedings,
Part V 13, pages 740–755. Springer, 2014. 6, 8, 12
[20] OpenAI. Introducing ChatGPT, 2022. 1, 2, 5
[21] Dustin Podell, Zion English, Kyle Lacey, Andreas
Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and
Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint
arXiv:2307.01952, 2023. 1, 5, 6, 8
[22] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
transferable visual models from natural language supervision. In International conference on machine learning, pages
8748–8763. PMLR, 2021. 3, 12
[23] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.
Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
7

5. Conclusion
In this paper, we propose the first Chinese-English bilingual text-to-image model in the interior design field, and
optimize the data processing and model training methods to
meet the needs of the design field such as data scarcity, complex text descriptions and image elements, and diverse resolution pictures, and finally obtaine a good generation effect
in this field. In the future, we will continue to try to optimize
the model in the following directions, including continuing
to increase the quantity and quality of text-image data, accessing the knowledge of large language models, and using
multidimensional feedback reinforcement learning.

References
[1] Daria Bakshandaeva Christoph Schuhmann Ksenia Ivanova
Nadiia Klokova Alex Shonenkov, Misha Konstantinov. If:
Title of the repository, 2023. 7
[2] Martin Arjovsky, Soumith Chintala, and Léon Bottou.
Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223.
PMLR, 2017. 7
[3] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th
annual international conference on machine learning, pages
41–48, 2009. 1, 3
[4] Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and
Sergey Levine. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023. 1,
8
[5] Hanqun Cao, Cheng Tan, Zhangyang Gao, Guangyong
Chen, Pheng-Ann Heng, and Stan Z Li. A survey on generative diffusion model. arXiv preprint arXiv:2209.02646,
2022. 1, 7
[6] Junming Chen, Zichun Shao, and Bin Hu. Generating interior design from text: A new diffusion model-based method
for efficient creative design. Buildings, 13(7):1861, 2023. 1
[7] Yihao Chen, Xianbiao Qi, Jianan Wang, and Lei Zhang.
Disco-clip: A distributed contrastive loss for memory efficient clip training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
22648–22657, 2023. 3
[8] Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye,
Qinghong Yang, and Ledell Wu. Altclip: Altering the language encoder in clip for extended language capabilities.
arXiv preprint arXiv:2211.06679, 2022. 6, 8, 12
[9] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng,
Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao,
Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information
Processing Systems, 34:19822–19835, 2021. 7
[10] Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang.
Cogview2: Faster and better text-to-image generation via

9

Dong, Junqing He, et al. Fengshenbang 1.0: Being the
foundation of chinese cognitive intelligence. arXiv preprint
arXiv:2209.02970, 2022. 8
[38] Zhengbang Zhu, Hanye Zhao, Haoran He, Yichao Zhong,
Shenyu Zhang, Yong Yu, and Weinan Zhang. Diffusion models for reinforcement learning: A survey. arXiv preprint
arXiv:2311.01223, 2023. 1, 8

[24] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1
(2):3, 2022. 7
[25] Danilo Rezende and Shakir Mohamed. Variational inference
with normalizing flows. In International conference on machine learning, pages 1530–1538. PMLR, 2015. 7
[26] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Björn Ommer. High-resolution image
synthesis with latent diffusion models. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pages 10684–10695, 2022. 1, 8
[27] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted
Intervention–MICCAI 2015: 18th International Conference,
Munich, Germany, October 5-9, 2015, Proceedings, Part III
18, pages 234–241. Springer, 2015. 3
[28] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour,
Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans,
et al. Photorealistic text-to-image diffusion models with deep
language understanding. Advances in Neural Information
Processing Systems, 35:36479–36494, 2022. 7
[29] Christoph Schuhmann, Richard Vencu, Romain Beaumont,
Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo
Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m:
Open dataset of clip-filtered 400 million image-text pairs.
arXiv preprint arXiv:2111.02114, 2021. 3, 8
[30] Jiaming Song, Chenlin Meng, and Stefano Ermon.
Denoising diffusion implicit models.
arXiv preprint
arXiv:2010.02502, 2020. 7
[31] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert,
Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al.
Llama 2: Open foundation and fine-tuned chat models. arXiv
preprint arXiv:2307.09288, 2023. 5
[32] Siddarth Venkatraman, Shivesh Khaitan, Ravi Tej Akella,
John Dolan, Jeff Schneider, and Glen Berseth. Reasoning
with latent diffusion in offline reinforcement learning. arXiv
preprint arXiv:2309.06599, 2023. 1, 8
[33] Pascal Vincent. A connection between score matching and
denoising autoencoders. Neural computation, 23(7):1661–
1674, 2011. 7
[34] An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang
Zhang, Jingren Zhou, and Chang Zhou. Chinese clip:
Contrastive vision-language pretraining in chinese. arXiv
preprint arXiv:2211.01335, 2022. 8
[35] Fulong Ye, Guangyi Liu, Xinya Wu, and Ledell Yu Wu.
Altdiffusion: A multilingual text-to-image diffusion model.
ArXiv, abs/2308.09991, 2023. 8
[36] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New
similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational
Linguistics, 2:67–78, 2014. 6, 8, 12
[37] Jiaxing Zhang, Ruyi Gan, Junjie Wang, Yuxiang Zhang,
Lin Zhang, Ping Yang, Xinyu Gao, Ziwei Wu, Xiaoqun

10

A. Example Images Generated by iDesigner
Figure 5 presents the example images generated by the proposed iDesigner, which illustrate the effective of our approach in generating high-quality images.

B. CLIP Results in General Dataset
Table 3 and 4 presents the performance of our CLIP model
on datasets comprising English and Chinese captions respectively. A CLIP model endowed with robust bilingual
comprehension capabilities can significantly enhance the
ability of our iDesigner to understand user-input prompts
and subsequently generate images that accurately conform
to the given prompts.

11

Figure 5. Examples of image comparisons with different resolutions. The top of images are generated by human designer, the below of
images are generated by iDesigner.

Flickr30K
Image → Text

Text → Image

MSCOCO
Image → Text
Text → Image

Model

R@1

R@5

R@10

R@1

R@5

R@10

R@1

R@5

R@10

R@1

R@5

R@10

CLIP [22]
AltCLIP [8]
our-CLIP

85.1
86.0
88.4

97.3
98.0
98.8

99.2
99.1
99.9

65.0
72.5
75.7

87.1
91.6
93.8

92.2
95.4
96.9

56.4
58.6
61.2

79.5
80.6
84.8

86.5
87.8
90.3

36.5
42.9
49.2

61.1
68.0
70.3

71.1
77.4
79.6

Table 3. Zero-shot image-text retrieval results on Flickr30K [36] and MSCOCO [19] datasets.

Flickr30K-CN
Image → Text
Text → Image
h

MSCOCO-CN
Image → Text
Text → Image

Model

R@1

R@5

R@10

R@1

R@5

R@10

R@1

R@5

R@10

R@1

R@5

R@10

CLIP [22]
AltCLIP [8]
our-CLIP

2.3
69.8
73.2

8.1
89.9
90.3

12.6
94.7
96.5

0
84.8
88.1

2.4
97.4
98.2

4.0
98.8
99.1

0.6
63.9
66.0

4.1
87.2
91.1

7.1
93.9
96.6

1.8
62.8
69.7

6.7
88.8
91.3

11.9
95.5
96.8

Table 4. Zero-shot image-text retrieval results on Flickr30K-CN [36] and MSCOCO-CN [18] datasets.

12