Wings: Learning Multimodal LLMs without Text-only Forgetting

Authors: Yi-Kai Zhang, Shiyin Lu, Yang Li, Yanqing Ma, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye

Year: 2024

Source: https://arxiv.org/abs/2406.03496

TLDR:

The paper introduces WINGS, a novel multimodal large language model (MLLM) designed to mitigate the issue of text-only instruction forgetting that occurs in MLLMs when fine-tuned with multimodal data. WINGS achieves this by analyzing attention shifts within the model and implementing parallel visual and textual learners that compensate for these shifts, effectively acting as "wings" to balance the focus on both modalities. The model utilizes a Low-Rank Residual Attention (LoRRA) architecture to maintain efficiency and has been shown to outperform comparable MLLMs on both text-only and visual question-answering tasks, as well as on a newly developed Interleaved Image-Text (IIT) benchmark.

Free Login To Access AI Capability

Free Access To ChatGPT

WINGS is a new multimodal large language model that addresses the problem of text-only instruction forgetting in MLLMs by introducing parallel visual and textual learners to balance attention between modalities, resulting in improved performance on both text-only and visual tasks.

Free Access to ChatGPT

Abstract

Multimodal large language models (MLLMs), initiated with a trained LLM, first align images with text and then fine-tune on multimodal mixed inputs. However, the MLLM catastrophically forgets the text-only instructions, which do not include images and can be addressed within the initial LLM. In this paper, we present Wings, a novel MLLM that excels in both text-only dialogues and multimodal comprehension. Analyzing MLLM attention in multimodal instructions reveals that text-only forgetting is related to the attention shifts from pre-image to post-image text. From that, we construct extra modules that act as the boosted learner to compensate for the attention shift. The complementary visual and textual learners, like "wings" on either side, are connected in parallel within each layer's attention block. Initially, image and text inputs are aligned with visual learners operating alongside the main attention, balancing focus on visual elements. Textual learners are later collaboratively integrated with attention-based routing to blend the outputs of the visual and textual learners. We design the Low-Rank Residual Attention (LoRRA) to guarantee high efficiency for learners. Our experimental results demonstrate that Wings outperforms equally-scaled MLLMs in both text-only and visual question-answering tasks. On a newly constructed Interleaved Image-Text (IIT) benchmark, Wings exhibits superior performance from text-only-rich to multimodal-rich question-answering tasks.

Method

The authors of the WINGS paper employed a methodology that involved analyzing the attention mechanisms within multimodal large language models (MLLMs) to identify the root cause of text-only instruction forgetting. They observed that this forgetting was related to shifts in attention from text to image content. To address this, they developed a novel architecture that includes visual and textual learners operating in parallel within each layer's attention block. These learners are designed to compensate for the attention shifts and are connected in a manner resembling wings on either side of the model. The visual learners focus on processing image features, while the textual learners enhance text comprehension. The integration of these learners is managed through attention-based routing, and the overall efficiency of the model is maintained through the use of a Low-Rank Residual Attention (LoRRA) architecture. The authors also constructed a new Interleaved Image-Text (IIT) benchmark to evaluate the model's performance in mixed-modality scenarios.

Main Finding

The authors discovered that the phenomenon of text-only instruction forgetting in multimodal large language models (MLLMs) is closely related to the shifts in attention that occur when the model processes images within a sequence of text. They found that as MLLMs are trained on multimodal data, they tend to excessively focus on visual tokens, which can lead to a decline in performance on text-only tasks. This discovery prompted them to create the WINGS model, which incorporates visual and textual learners to balance the attention between modalities and prevent the forgetting of text-only instructions. The authors also observed that a well-trained MLLM with superior text-only performance exhibits a positive correlation between the attention weight proportions on text tokens before and after the inserted image, suggesting that a more similar focus on both parts of the text indicates less disruption to the MLLM's essential attention.

Conclusion

The conclusion of the paper is that the WINGS model, with its innovative architecture featuring parallel visual and textual learners, effectively mitigates the text-only instruction forgetting problem commonly encountered in multimodal large language models (MLLMs). The WINGS model not only retains high performance on text-only tasks but also excels in multimodal comprehension and dialogue, as demonstrated by its superior performance on both text-only and visual question-answering tasks, as well as on the newly constructed Interleaved Image-Text (IIT) benchmark. This makes WINGS a versatile and robust solution for applications requiring multimodal understanding and text-only proficiency.

Keywords

WINGS, multimodal large language models (MLLMs), text-only instruction forgetting, attention shift, visual learners, textual learners, Low-Rank Residual Attention (LoRRA), Interleaved Image-Text (IIT) benchmark, attention-based routing, model alignment, efficiency.

The Best AI PDF Reader