PopAi Logo PopAi
|
Your Personal AI Workspace

Parrot: Multilingual Visual Instruction Tuning

Authors: Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye
TLDR:
The paper presents PARROT, a method that enhances the multilingual capabilities of Multimodal Large Language Models (MLLMs) by aligning visual tokens with language-specific inputs using textual guidance and a Mixture-of-Experts (MoE) module. It addresses the issue of "multilingual erosion" caused by English-centric datasets in Supervised Fine-Tuning (SFT). PARROT demonstrates state-of-the-art performance on the Massive Multilingual Multimodal Benchmark (MMMB) and MMBench, particularly excelling in languages like Turkish and Arabic. The authors plan to release the source code and training dataset to the public.
Free Login To Access AI Capability
Free Access To ChatGPT

The paper introduces PARROT, a novel method that improves the multilingual capabilities of Multimodal Large Language Models (MLLMs) by aligning visual tokens with language-specific inputs using textual guidance and a Mixture-of-Experts (MoE) module, addressing the challenge of "multilingual erosion" and demonstrating superior performance on new multilingual benchmarks.

Free Access to ChatGPT

Abstract

The rapid development of Multimodal Large Language Models (MLLMs) like GPT-4V has marked a significant step towards artificial general intelligence. Existing methods mainly focus on aligning vision encoders with LLMs through supervised fine-tuning (SFT) to endow LLMs with multimodal abilities, making MLLMs' inherent ability to react to multiple languages progressively deteriorate as the training process evolves. We empirically find that the imbalanced SFT datasets, primarily composed of English-centric image-text pairs, lead to significantly reduced performance in non-English languages. This is due to the failure of aligning the vision encoder and LLM with multilingual tokens during the SFT process. In this paper, we introduce Parrot, a novel method that utilizes textual guidance to drive visual token alignment at the language level. Parrot makes the visual tokens condition on diverse language inputs and uses Mixture-of-Experts (MoE) to promote the alignment of multilingual tokens. Specifically, to enhance non-English visual tokens alignment, we compute the cross-attention using the initial visual features and textual embeddings, the result of which is then fed into the MoE router to select the most relevant experts. The selected experts subsequently convert the initial visual tokens into language-specific visual tokens. Moreover, considering the current lack of benchmarks for evaluating multilingual capabilities within the field, we collect and make available a Massive Multilingual Multimodal Benchmark which includes 6 languages, 15 categories, and 12,000 questions, named as MMMB. Our method not only demonstrates state-of-the-art performance on multilingual MMBench and MMMB, but also excels across a broad range of multimodal tasks. Both the source code and the training dataset of Parrot will be made publicly available.

Method

The authors used a novel methodology called PARROT, which involves leveraging textual guidance to drive the alignment of visual tokens at the language level. This is achieved through the use of a Mixture-of-Experts (MoE) module that selects the most relevant language experts based on cross-attention calculations between visual features and textual embeddings. This approach allows for the conversion of English-biased visual tokens into language-specific embeddings, thereby enhancing the multilingual capabilities of MLLMs.

Main Finding

The authors discovered that existing Multimodal Large Language Models (MLLMs) suffer from a phenomenon they termed "multilingual erosion," where the models' performance in non-English languages deteriorates due to the use of English-centric datasets during Supervised Fine-Tuning (SFT). To address this, they developed PARROT, a method that uses textual guidance to align visual tokens with language-specific inputs, employing a Mixture-of-Experts (MoE) module to convert English-biased visual tokens into language-specific embeddings. They also created a new benchmark, the Massive Multilingual Multimodal Benchmark (MMMB), to fairly assess the multilingual capabilities of MLLMs. Their findings showed that PARROT achieves state-of-the-art performance on both the MMMB and MMBench benchmarks, significantly outperforming existing models in languages such as Turkish and Arabic.

Conclusion

The conclusion of the paper is that the authors have successfully addressed the challenge of enhancing the multilingual capabilities of Multimodal Large Language Models (MLLMs) with their novel method, PARROT. PARROT utilizes textual guidance to align visual tokens with language-specific inputs through a Mixture-of-Experts (MoE) module, effectively mitigating the issue of "multilingual erosion." The method's effectiveness is validated by its state-of-the-art performance on the newly introduced Massive Multilingual Multimodal Benchmark (MMMB) and MMBench, particularly in languages like Turkish and Arabic. The authors also commit to making the source code and training dataset publicly available, emphasizing their contribution to advancing the field of MLLMs and promoting equitable access to technological benefits across linguistic and cultural diversities.

Keywords

Multimodal Large Language Models (MLLMs), Visual Instruction Tuning, Multilingual Capabilities, Supervised Fine-Tuning (SFT), Multilingual Erosion, PARROT, Mixture-of-Experts (MoE), Massive Multilingual Multimodal Benchmark (MMMB), MMBench

Powered By PopAi ChatPDF Feature
The Best AI PDF Reader

Read Paper with AI

PopAi Logo

Parrot: Multilingual Visual Instruction Tuning

AI Presentation

Chrome Extension

  • One-click PDF Summary One-click PDF Summary
  • Capture and Analyze Charts Capture and Analyze Charts
  • Extract Chart Data Extract Chart Data

Download our apps and chrome extension powered by ChatGPT