NExT-GPT:

Any-to-Any Multimodal LLM

Shengqiong Wu, Hao Fei^*, Leigang Qu, Wei Ji, Tat-Seng Chua

NExT++ Research Center, National University of Singapore

ICML 2024, Oral (^*Correspondence)

Video Presentation

Abstract

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community.

Technical Description

• Architecture

Figure 1: By connecting LLM with multimodal adaptors and diffusion decoders, NExT-GPT achieves universal multimodal understanding and any-to-any modality input and output.

Multimodal Encoding Stage. Leveraging existing well-established models to encode inputs of various modalities. Here we take advantage of the ImageBind, which is a unified high-performance encoder across six modalities. Then, via the linear projection layer, different input representations are mapped into language-like representations that are comprehensible to the LLM.
LLM Understanding and Reasoning Stage. An LLM is used as the core agent of NExT-GPT. Technically, we employ the Vicuna. LLM takes as input the representations from different modalities and carries out semantic understanding and reasoning over the inputs. It outputs 1) the textual responses directly, and 2) signal tokens of each modality that serve as instructions to dictate the decoding layers whether to generate multimodal contents, and what content to produce if yes.
Multimodal Generation Stage. Receiving the multimodal signals with specific instructions from LLM (if any), the Transformer-based output projection layers map the signal token representations into the ones that are understandable to following multimodal decoders. Technically, we employ the current off-the-shelf latent conditioned diffusion models of different modal generations, i.e., Stable Diffusion (SD) for image synthesis, Zeroscope for video synthesis, and AudioLDM for audio synthesis.

• System Inference

In Figure 2 we further illustrate the inference procedure of NExT-GPT. Given certain user inputs of any combination of modalities, the corresponding modal encoders and projectors transform them into feature representations and passed to LLM (except the text inputs, which will be directly fed into LLM). Then, LLM decides what content to generate, i.e., textual tokens, and modality signal tokens. If LLM identifies a certain modality content (except language) to be produced, a special type of token will be output indicating the activation of that modality; otherwise, no special token output means deactivation of that modality. Technically, we design the '<IMG_i>' (i=0,...,4) as image signal tokens; '<AUD_i>' (i=0,...,8) as audio signal tokens; and '<VID_i>' (i=0,...,24) as video signal tokens. After LLM, the text responses are output to the user; while the representations of the signal tokens of certain activated modalities are passed to the corresponding diffusion decoders for content generation.

Figure 2: NExT-GPT inference process. Grey colors denote the deactivation of the modules.

• Lightweight Multimodal Alignment Learning

We design the system with mainly three tiers in loose coupling, and we only need to update the two projection layers at encoding side and decoding side.

Encoding-side LLM-centric Multimodal Alignment. We align different inputting multimodal features with the text feature space, the representations that are understandable to the core LLM.
Decoding-side Instruction-following Alignment. We minimize the distance between the LLM's modal signal token representations (after each Transformer-based project layer) and the conditional text representations of the diffusion models. Since only the textual condition encoders are used (with the diffusion backbone frozen), the learning is merely based on the purely captioning texts, i.e., without any visual or audio inputs.

Figure 3: Illustration of the lightweight multimodal alignment learning of encoding and decoding.

• Modality-switching Instruction Tuning (MosIT)

Further instruction tuning (IT) is necessary to enhance the capabilities and controllability of LLM. To facilitate the development of any-to-any MM-LLM, we propose a novel Modality-switching Instruction Tuning (MosIT). As illustrated in Figure 4, when an IT dialogue sample is fed into the system, the LLM reconstructs and generates the textual content of input (and represents the multimodal content with the multimodal signal tokens). The optimization is imposed based on gold annotations and LLM's outputs. In addition to the LLM tuning, we also fine-tune the decoding end of NExT-GPT. We align the modal signal token representation encoded by the output projection with the gold multimodal caption representation encoded by the diffusion condition encoder. Thereby, the comprehensive tuning process brings closer to the goal of faithful and effective interaction with users.

Figure 4: Illustration of modality-switching instruction tuning.

MosIT Data

All the existing IT datasets fail to meet the requirements for our any-to-any MM-LLM scenario. We thus construct the MosIT dataset of high quality. The data encompasses a wide range of multimodal inputs and outputs, offering the necessary complexity and variability to facilitate the training of MM-LLMs that can handle diverse user interactions and deliver desired responses accurately.

Demonstrations

• Example-1: Text → Text+Image+Audio

I've recently started practicing mindfulness and meditation as part of my daily routine, but sometimes it's challenging. Can you help me with some techniques?

Of course! Mindfulness and meditation can be a wonderful way to cultivate a sense of calm and focus. One technique you might find helpful is called "mindful breathing." It involves taking slow, deep breaths and focusing your attention on the sensation of the breath entering and leaving your body. Another technique is "body scan meditation," where you consciously scan your entire body from head to toe, paying attention to any sensations or discomfort. You can also try guided meditations that provide a structured practice. Would you like me to find a video that demonstrates mindful breathing or body scan meditation?

I wish to listen to a music for the meditation first.

I guess you may want some musics that sound soft.

Base on the audio content, visualize it in an image.

Can you infer what time is in the picture?

Based on the image, it is likely that the time is either early morning or late afternoon. The sunlight is either low or setting, creating a serene atmosphere.

• Example-2: Text+Image → Text+Image+Video

Show me another image in similar contents.

Is there any suitable place to see this scene?

Yes, there is a suitable place to see this scene. You can see the beautiful pink cherry blossoms in full bloom on a tree next to a river. The image is taken from a distance, allowing you to appreciate the vibrant colors and the serene atmosphere.

Use a sound to describe the beautiful scene.

Sure!

It would be great to see a video that show the stunning scene.

Certainly! Here is the video scene.

• Example-3: Text+Video → Text+Image

Is the guy in the video iron man?

Yes, the guy in the video is iron man.

Could you show me a comic-style picture of iron man?

Certainly! Here is a comic-style picture of Iron Man:

I need a more abstract one.

Sure, here's an abstract image oflron Man:

• Example-4: Text+Video → Text+Audio

What is the man doing in the video?

The man in the video is celebrating his birthday by lighting a candle on a cake and making a wish.

What is the man wearing?

The man is wearing a white shirt.

How about synthesizing a song to celebrate his birthday please?

It would be a great idea to synthesize a song to celebrate the man's birthday. Here's a sample of a song that could be used:

Verse 1: Happy birthday, dear friend, You're shining bright, With a smile so wide. You're the light in our lives
Chorus: Celebrate, celebrate, With laughter and joy, With love and happiness, Bringing happiness to those around
Verse 2: You're the one who brings us together, With your warmth and your grace, You're the one who makes us feel, A part of this special place.
Chorus: Celebrate, celebrate, With laughter and joy, With love and happiness, Bringing happiness to those around
Bridge: So let's raise a glass, To a life filled with love and grace. With laughter and joy, May your birthday be filled with peace
Chorus: Celebr

I guess you made me wrong. I mean, show the song in an audio.

Sorry for the mistake! Here's an audio of the song:

• Example-5: Text+Audio → Text+Image+Video

Describe what can be heard from the sound:

The sound of a growling lion is a deep, rumbling noise that starts low in the chest and gradually builds in intensity. It is often associated with a sense of power and strength, as well as a sense of danger or warning. The growl of a lion can be heard from a distance and is often used by the animal to communicate with other members of its pride.

Displaying an image to help visualize it would be fantastic.

Sure, here is an image of a lion growling:

That looks great! How about showing a video that explains how the lions catch its prey?

• Example-6: Text → Text+Image+Video+Audio

I've recently taken up painting as a hobby, and I'm looking for some inspiration. Can you help me find some interesting ideas?

Of course! Painting is a wonderful creative outlet. What kind of subjects are you interested in exploring? Landscapes, portraits, abstract art, or something else?

Abstract art

Abstract art sounds great. lt allows you to express your emotions and ideas without any specific subject matter. You can experiment with different techniques like brushstrokes, color mixing, and layeringHere's an image of an abstract painting that might inspire you:

And also water painting. maybe l could start by watching some teaching videos.

That's good idea, please check the tutorial video:

Fantastic! Can you describe the video?

Of course! Tha video is a tutorial on watercolor painting. lt covers the basics ot mixing colors, creating textures and layering techniques. The instructor demonstrates different brushstrokes and techniques to achieve various effects.

That's good. So I want to see some water painting pictures.

Here are some watercolor paintings:

That's so amazing painting. I wish I could do that. Can you show me some tips about water painting?

Sure, here are a few tips for water painting:

1. Use a wet brush to create soft, blended strokes.
2. Experiment with different watercolor techniques such as wet-on-wet, wet-on-dry, and wet-of-wet.
3. Pay attention to the transparency of the watercolor and use thin layers to create depth.
4. Use a variety of brushes with different shapes and sizes to achieve different effects.
5. Explore the use of negative space to create a sense of depth and dimension.
6. Don't be afraid to make mistakes and embrace the unpredictability of watercolor.

If l were painting, l am used to play some musics. I think that is helpful for my creation.

Totally agreed! Music helps a lot for art creation, especially some inspiring music:

• Example-7: Text → Text+Image

I want a tattoo of a sunflower on my arm, but l am not sure what style would look best on me.

A sunflower tattoo can be done in various styles, such as realistic, abstract, or even minimalist. lt all depends on your personal taste and the design you want to go with. You can also consider the color scheme and placement of the tattoo to make it look more natural and harmonious with your body. Would you like me to provide you with some examples of sunflower tattoo designs?

Yes please!

Here are a few examples of sunflower tattoo designs:

1. A realistic sunflower tattoo with vibrant petals and leaves, surrounded by a detailed background.
2. An abstract sunflower tattoo, with a minimalist design that focuses on the shape and color of the flower.
3. A simple sunflower tattoo with a single sunflower petal, representing the beauty and simplicity of the flower.

Good tips, can you show me the picture the design?

Sure, here are the images of the sunflower tattoo designs: 1. 2. 3.

• Example-8: Text+Video → Text+Image+Audio

What is the unusual part of this video?

The unusual part of this video is that a dog is riding a skateboard. It is not common to see a dog on a skateboard, as dogs are typically not able to ride skateboards or bicycles. This scene is unique and interesting because it showcases a dog's ability to adapt and participate in human-made activities, demonstrating its intelligence and adaptability.

That is pretty cool. How about adding some music to make this scene look more funny?

Certainly! Here's a synthesized music track that could enhance the humor and lightheartedness of the scene:

Do you have any other similar unusual scenes? Show them.

I have a set of such kinds of scenes:

I had an idea for one scene - what if a pig is flying in the sky?

It is not a common sight to see a pig or any other animal, let alone a pig, flying in the air. This scene is likely meant to be humorous or whimsical, as it is an unconventional and lighthearted depiction of a pig enjoying its time in the sky.

• Example-9: Text → Text+Image+Video+Audio

I'm planning a trip to New Orleans, and l'd love some recommendations for fun things to do and places to visit.

New Orleans is a vibrant city with a rich history and culture. Some popular attractions include the French Quarter, the National WWiMuseum, and the St. Louis Cathedral. You can also explore the historic Garden District and visit the iconic Bourbon Street for some night life. Additionally, don't miss the opportunity to try some delicious Cajun cuisine!

What kind of cajun cuisine? can you show me some pictures?

Certainly! Cajun cuisine is known for its hearty and flavorful dishes. Here are some pictures to give you a taste of what you can expect:

I've heard that New Orleans is famous for jazz music. Show me a piece of jazz music please.

You are right. Here is a jazz music:

Fantastic! I often see videos in which a bunch of people dance to jazz music, can you show me some?

Of course!

• Example-10: Text → Text+Image

Now I want to start a pet shop, but I need a logo for my shop. Do you have any ideas?

Certainly! I can help you create a logo for your pet shop. To do this I would need to know more about the specific theme or style you'd like to have for your pet shop. Factors such as color scheme, animal types, and overall atmosphere could play a role in designing the logo.

It is a turtle pet shop, so the logo could be a turtle.

I see, then, how about a logo like this?

It's cute, but I'm aiming for a logo that's more distinctive and has a cool style.

Understand. How about this one?

It's nice. Can you explain this logo? and why is it designed like this?

The use of a green turtle, which is a distinctive and easily recognizable symbol, adds a sense of uniqueness and character to the logo. Additionally, the circular shape of the logo gives it a modern and clean appearance, making it stand out from other pet shop logosThe combination of these elements gives the logo a cool style that is likely intended to appeal to potential customers and convey a sense of fun, creativity, and innovation.

• Example-11: Text+Image → Text+Image

Where is the sculpture located?

It is the Statue of Liberty. The sculpture is located on the grounds of the Liberty Island in New York City.

Can you show me a closer look for the sculpture.

Certainly!

• Example-12: Text+Image → Text+Audio

How many cars are in this picture?

There are two cars shown in this picture.

Tell me what kind of car are they?

The two cars in the picture are SUVs.

Can you show me the sound of starting the engines of them.

Sure, here's the sound of the engine of the SUV:

BibTeX

@inproceedings{wu24next,
  title={NExT-GPT: Any-to-Any Multimodal LLM},
  author={Wu, Shengqiong and Fei, Hao and Qu, Leigang and Ji, Wei and Chua, Tat-Seng},
  booktitle={Proceedings of the International Conference on Machine Learning},
  pages = {53366--53397},
  year={2024}
}