Prompt Highlighter: Interactive Control for Multi-Modal LLMs Yuechen Zhang1 Shengju Qian1 Bohao Peng1 1 The Chinese University of Hong Kong Shu Liu2 2 SmartMore Jiaya Jia1,2 arXiv:2312.04302v1 [cs.CV] 7 Dec 2023 https://julianjuaner.github.io/projects/PromptHighlighter/ Language Prompt I would like to control the degree of compactness Vision & Language Prompt User: Write a summary of A Mid-Summer Nights' Dream. Make it compact. Vicuna Normal Inference 256 tokens MAKE IT COMPACT The image features a man wearing a red and green soccer jersey, smiling and posing for a picture. He is likely a soccer player, as he is wearing a sports uniform and is standing on a field. The man appears to be enjoying himself… Unstable 234 tokens → super compact 141 tokens + Compactness is important. 280 tokens Prompt Highlighter: Control Partial Prompts User: Please describe this image. LLaVA Normal Inference A Midsummer Night's Dream is a comedy play written by William Shakespeare. The play follows the events that take place… Prompt Engineering I want to know more about the man and the jersey... Prompt Highlighter: Inference with Highlighted Region 121 tokens The image features a smiling man, who is none other than the famous soccer player, Cristiano Ronaldo. He is wearing a red and green jersey and is standing in front of a crowd… In A Mid-Summer Nights' Dream, a group of young lovers, including Hermia, Lysander, Demetrius, and Helena, become… 65 tokens The man in the image is wearing a red, green, and yellow shirt, which is the national soccer team uniform of Portugal. He is smiling and appears to be enjoying the game. The shirt has a green and red stripe, and the number 7 is displayed on it … In A Mid-Summer Nights' Dream, four young lovers run away from an arranged marriage and hide in the forest where they are controlled by a mischievous fairy. A troupe of actors perform a play within the play that resolves the conflicts and the lovers are reunited. Figure 1. Prompt Highlighter facilitates token-level user interactions for customized generation, compatible with both LLMs and VLMs. Compared with vanilla inference and prompt engineering, the context-highlighted inference provided by our method offers controllable generations and produces customized results. Outputs correlated with the highlighted parts are underlined. Abstract and VLMs, achieving impressive customized generation results without training. Experiments confirm its effectiveness in focusing on input contexts and generating reliable content. Without tuning on LLaVA-v1.5, our method secured 69.5 in the MMBench test and 1552.5 in MME-perception. Code is available at: https://github.com/dvlabresearch/Prompt-Highlighter/. This study targets a critical aspect of multi-modal LLMs’ (LLMs&VLMs) inference: explicit controllable text generation. Multi-modal LLMs empower multi-modality understanding with the capability of semantic generation yet bring less explainability and heavier reliance on prompt contents due to their autoregressive generative nature. While manipulating prompt formats could improve outputs, designing specific and precise prompts per task can be challenging and ineffective. To tackle this issue, we introduce a novel inference method, Prompt Highlighter, which enables users to highlight specific prompt spans to interactively control the focus during generation. Motivated by the classifier-free diffusion guidance, we form regular and unconditional context pairs based on highlighted tokens, demonstrating that the autoregressive generation in models can be guided in a classifier-free way. Notably, we find that, during inference, guiding the models with highlighted tokens through the attention weights leads to more desired outputs. Our approach is compatible with current LLMs 1. Introduction Large Language Models (LLMs) have driven significant progress in a multitude of natural language processing tasks [1–9]. Further advancements have been achieved by extending these models to handle vision-language tasks [10–15] through visual-language alignment and instruction tuning. These efforts have led to the development of Vision Language Models (VLMs), which can generate text based on multi-modal inputs. Due to its autoregressive nature, the typical generation process in LLMs and VLMs (multi-modal LLMs) is primarily conditioned on input contexts. Prompt engineering [16–19] has emerged as a common interaction mechanism between humans and 1 language models, where diverse formats and content of prompts are employed to steer the generation towards desired outcomes. However, prompt engineering often relies on empirical intuition and requires careful design of the context, making it less accessible for non-experts. As illustrated in the left part of Fig. 1, even the meticulously crafted prompts, which convey the concept of ‘compactness’ clearly, can lead to unpredictable outputs that fail to meet the requirements. Instead of manipulating prompt-level contexts (i.e., prompt engineering) to control LMs’ generation process, we propose a novel inference approach, Prompt Highlighter, that enables token-level user interactions for personalized generations. Our method allows users to interact with multimodal LLMs in a manner analogous to applying a highlighter tool on the input context in the text editor, enabling users to emphasize desired parts by highlighting them. This highlighting mechanism is achieved by constructing a regular and unconditional input context pair with different textual embeddings in the highlighted tokens. Subsequently, we can adjust the model’s focus on the highlighted components by employing the classifier-free guidance [20– 22] on predicted token probabilities. Moreover, by probing cross-token attention maps, we discover a robust correlation between attention scores and the semantic significance of tokens. This suggests that, in the autoregressive generation process of language models, the semantic relationship between tokens can be represented to a certain extent by their attention scores. Building on this insight, we introduce an attention activation strategy that adjusts the attention weights associated with a highlighted part. Specifically, Prompt Highlighter employs an adjusted attention mask to reweight corresponding attention scores, enabling a more focused generation on highlighted parts. As illustrated in Fig. 1, compared to vanilla inference, our highlighted inference can guide the generation process to produce controllable results that align more closely with user needs. Prompt Highlighter is compatible with mainstream transformer-based multi-modal LLMs. This compatibility encompasses VLMs that use precise patch-wise visual token mapping, such as LLaVA [10, 23, 24], as well as methods that employ implicit query-ba qw[] sed visual token mapping, like those based on QFormer [11, 13–15]. This novel interaction paradigm with highlighted sections during the generation process goes beyond what prompt engineering can offer. We further demonstrate the effectiveness of Prompt Highlighter by evaluating it using comprehensive multimodal benchmarks. We verify that directly highlighting the full image context in VLMs can significantly improve the quality of generated image captions [25] and questionanswering results. Specifically, our method can effectively mitigate the model’s propensity to hallucinate by guiding its focus toward reliable contexts, thereby enhancing overall performance. Notably, without additional training, our method improves the performance of the baseline LLaVAv1.5, securing 2nd place in both MMBench [26] and MMEperception [27] leaderboards. Our contributions can be summarized as follows: (1) We pioneer the exploration of fine-grained human-model interactions in multi-modal LLMs, proposing a plug-andplay pipeline that enables token-level user interactions for controllable generation. (2) We conduct extensive experiments on comprehensive benchmarks, demonstrating that our method significantly enhances the overall performance. 2. Related Works 2.1. Multi-Modal LLMs Recent Large Language Models (LLMs) [1, 7–9, 28–30] play a significant role in natural language processing tasks, particularly in language generation and question answering. Building upon these pre-trained language models, VisionLanguage Models (VLMs) [10, 11, 13–15, 31] further introduce the alignment between vision and language modalities by leveraging extensive training on image-caption pairs or image-question conversations. There are two prevalent methods for aligning vision and verbiage modalities. The first method, exemplified by LLaVA [10], directly maps image patches to tokens using a projector, establishing a oneto-one correspondence. The second method, represented by models like BLIP2 [13, 32], employs a Query Transformer (Q-Former) after getting image features to establish a nonuniform patch-token mapping. These methods use learnable queries to get compressed image features, yielding visual tokens rich with semantic information. 2.2. Interactions with Multi-Modal LLMs Prompt engineering and interactions. Based on the autoregressive property of LLMs, users aim to control the generation results by modifying the input contexts. This largely determines the test-time interactions with LLMs, primarily executed through prompt engineering. Representative methods such as CoT [17] introduce demonstrations in the context to enhance reasoning ability. Other multi-branch designs like ToT and GoT [16, 18, 19, 33, 34] have been proposed for rich and reliable context generation and selfchecking. Aside from prompt engineering, human-model interactions have not been extensively explored in VLMs. Methods like Kosmos-2 [31], LLaVAInteractive [35], and LISA [36] enable grounding perception tasks such as detection, segmentation, and image editing through interaction with the language model. These task-oriented interactions require additional data collection and task-specific tunning. In contrast, Prompt Highlighter is plug-and-play for general text generation in pre-trained models. Classifier-free guidance and controllable generation. 2 update next token embedding $!"# = &((!"#) ! tokenize, embedding & ❄ Language Model ❄ "! normal context ! unconditional context !̅ Multi-modal input * + highlight mask + guidance + update CFG re-weight (!"# logit predictions token-wise embeddings highlighted attention mask self-attention activation Figure 2. An abstract pipeline of Prompt Highlighter. Users can control the focus of generation by marking out specific image regions or text spans. Then a token-level mask m is created to guide the language model’s inference. Classifier-Free Guidance (CFG) [20] enables a control on Diffusion Models’ generation process without a conventional classifier. Specifically, CFG’s step-wise sampling allows users to employ a negative prompt within the unconditional branch, effectively guiding the generation away from harmful distributions. This approach has been extended to language models by LLM-CFG [21], allowing a controllable text generation and improved performance. However, LLM-CFG still requires a pair-wise prompt design and does not support partial token-level reweighting within the context, which is vital for controlling VLM’s generation. Besides, methods in Diffusion Models [37, 38] achieve finegrained control over image generation using text prompts by emphasizing areas within cross-attention maps. Despite these advancements, fine-grained control over autoregressive generation in LLMs and VLMs is still challenging. Prompt Highlighter is proposed to tackle this issue. in which ϵt is the noise prediction conditioned on the previous output xt+1 and the text condition c. LLM-CFG [21] extended this property to autoregressive language models. Given a sequence of N tokens x = {x1 , . . . , xN }, the likelihood of predicting the entire sequence can be expressed as QN PΘ (x) = i PΘ (xi |xj is predicted. Then, the attention probability pi is calculated as exp(hi ) β mi · exp(ki ) pi = PN = PN . mj · exp(k ) j j=1 exp(hj ) j=1 β (8) This mechanism defines the activation scaling factor as β. For the unconditional branch, the attention score is deactivated by using a scaled negative mask in the inference P̃Θ (x̄i |s̄j Gy in all 500 cases. Given this property, the attention activation can capture a higher attention probability p, as defined in Eq. (8), from the given contexts. P This attentionPcontribution plotted in Fig. 9 is denoted as mi =1 (pi )/ j