NExT-GPT:

Any-to-Any Multimodal LLM

NExT++ Research Center, National University of Singapore
ICML 2024, Oral (*Correspondence)

Video Presentation

Abstract

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community.


Technical Description


• Architecture

Teaser

Figure 1: By connecting LLM with multimodal adaptors and diffusion decoders, NExT-GPT achieves universal multimodal understanding and any-to-any modality input and output.


  • Multimodal Encoding Stage. Leveraging existing well-established models to encode inputs of various modalities. Here we take advantage of the ImageBind, which is a unified high-performance encoder across six modalities. Then, via the linear projection layer, different input representations are mapped into language-like representations that are comprehensible to the LLM.
  • LLM Understanding and Reasoning Stage. An LLM is used as the core agent of NExT-GPT. Technically, we employ the Vicuna. LLM takes as input the representations from different modalities and carries out semantic understanding and reasoning over the inputs. It outputs 1) the textual responses directly, and 2) signal tokens of each modality that serve as instructions to dictate the decoding layers whether to generate multimodal contents, and what content to produce if yes.
  • Multimodal Generation Stage. Receiving the multimodal signals with specific instructions from LLM (if any), the Transformer-based output projection layers map the signal token representations into the ones that are understandable to following multimodal decoders. Technically, we employ the current off-the-shelf latent conditioned diffusion models of different modal generations, i.e., Stable Diffusion (SD) for image synthesis, Zeroscope for video synthesis, and AudioLDM for audio synthesis.


Teaser

• System Inference

In Figure 2 we further illustrate the inference procedure of NExT-GPT. Given certain user inputs of any combination of modalities, the corresponding modal encoders and projectors transform them into feature representations and passed to LLM (except the text inputs, which will be directly fed into LLM). Then, LLM decides what content to generate, i.e., textual tokens, and modality signal tokens. If LLM identifies a certain modality content (except language) to be produced, a special type of token will be output indicating the activation of that modality; otherwise, no special token output means deactivation of that modality. Technically, we design the '<IMGi>' (i=0,...,4) as image signal tokens; '<AUDi>' (i=0,...,8) as audio signal tokens; and '<VIDi>' (i=0,...,24) as video signal tokens. After LLM, the text responses are output to the user; while the representations of the signal tokens of certain activated modalities are passed to the corresponding diffusion decoders for content generation.


Teaser

Figure 2: NExT-GPT inference process. Grey colors denote the deactivation of the modules.


• Lightweight Multimodal Alignment Learning

We design the system with mainly three tiers in loose coupling, and we only need to update the two projection layers at encoding side and decoding side.

  • Encoding-side LLM-centric Multimodal Alignment. We align different inputting multimodal features with the text feature space, the representations that are understandable to the core LLM.
  • Decoding-side Instruction-following Alignment. We minimize the distance between the LLM's modal signal token representations (after each Transformer-based project layer) and the conditional text representations of the diffusion models. Since only the textual condition encoders are used (with the diffusion backbone frozen), the learning is merely based on the purely captioning texts, i.e., without any visual or audio inputs.

Teaser

Figure 3: Illustration of the lightweight multimodal alignment learning of encoding and decoding.


• Modality-switching Instruction Tuning (MosIT)

Further instruction tuning (IT) is necessary to enhance the capabilities and controllability of LLM. To facilitate the development of any-to-any MM-LLM, we propose a novel Modality-switching Instruction Tuning (MosIT). As illustrated in Figure 4, when an IT dialogue sample is fed into the system, the LLM reconstructs and generates the textual content of input (and represents the multimodal content with the multimodal signal tokens). The optimization is imposed based on gold annotations and LLM's outputs. In addition to the LLM tuning, we also fine-tune the decoding end of NExT-GPT. We align the modal signal token representation encoded by the output projection with the gold multimodal caption representation encoded by the diffusion condition encoder. Thereby, the comprehensive tuning process brings closer to the goal of faithful and effective interaction with users.


Teaser

Figure 4: Illustration of modality-switching instruction tuning.


MosIT Data

All the existing IT datasets fail to meet the requirements for our any-to-any MM-LLM scenario. We thus construct the MosIT dataset of high quality. The data encompasses a wide range of multimodal inputs and outputs, offering the necessary complexity and variability to facilitate the training of MM-LLMs that can handle diverse user interactions and deliver desired responses accurately.

Teaser

Demonstrations



• Example-1: Text → Text+Image+Audio

avatar-user

I've recently started practicing mindfulness and meditation as part of my daily routine, but sometimes it's challenging. Can you help me with some techniques?

avatar-bot

Of course! Mindfulness and meditation can be a wonderful way to cultivate a sense of calm and focus. One technique you might find helpful is called "mindful breathing." It involves taking slow, deep breaths and focusing your attention on the sensation of the breath entering and leaving your body. Another technique is "body scan meditation," where you consciously scan your entire body from head to toe, paying attention to any sensations or discomfort. You can also try guided meditations that provide a structured practice. Would you like me to find a video that demonstrates mindful breathing or body scan meditation?

avatar-user

I wish to listen to a music for the meditation first.

avatar-bot

I guess you may want some musics that sound soft.

avatar-user

Base on the audio content, visualize it in an image.

avatar-bot
avatar-user

Can you infer what time is in the picture?

avatar-bot

Based on the image, it is likely that the time is either early morning or late afternoon. The sunlight is either low or setting, creating a serene atmosphere.


• Example-2: Text+Image → Text+Image+Video

avatar-user

Show me another image in similar contents.

avatar-bot

avatar-user

Is there any suitable place to see this scene?

avatar-bot

Yes, there is a suitable place to see this scene. You can see the beautiful pink cherry blossoms in full bloom on a tree next to a river. The image is taken from a distance, allowing you to appreciate the vibrant colors and the serene atmosphere.

avatar-user

Use a sound to describe the beautiful scene.

avatar-bot

Sure!

avatar-user

It would be great to see a video that show the stunning scene.

avatar-bot

Certainly! Here is the video scene.


• Example-3: Text+Video → Text+Image

avatar-user

Is the guy in the video iron man?

avatar-bot

Yes, the guy in the video is iron man.

avatar-user

Could you show me a comic-style picture of iron man?

avatar-bot

Certainly! Here is a comic-style picture of Iron Man:

avatar-user

I need a more abstract one.

avatar-bot

Sure, here's an abstract image oflron Man:


• Example-4: Text+Video → Text+Audio

avatar-user

What is the man doing in the video?

avatar-bot

The man in the video is celebrating his birthday by lighting a candle on a cake and making a wish.

avatar-user

What is the man wearing?

avatar-bot

The man is wearing a white shirt.

avatar-user

How about synthesizing a song to celebrate his birthday please?

avatar-bot

It would be a great idea to synthesize a song to celebrate the man's birthday. Here's a sample of a song that could be used:

  • Verse 1: Happy birthday, dear friend, You're shining bright, With a smile so wide. You're the light in our lives
  • Chorus: Celebrate, celebrate, With laughter and joy, With love and happiness, Bringing happiness to those around
  • Verse 2: You're the one who brings us together, With your warmth and your grace, You're the one who makes us feel, A part of this special place.
  • Chorus: Celebrate, celebrate, With laughter and joy, With love and happiness, Bringing happiness to those around
  • Bridge: So let's raise a glass, To a life filled with love and grace. With laughter and joy, May your birthday be filled with peace
  • Chorus: Celebr

avatar-user

I guess you made me wrong. I mean, show the song in an audio.

avatar-bot

Sorry for the mistake! Here's an audio of the song:


• Example-5: Text+Audio → Text+Image+Video

avatar-user

Describe what can be heard from the sound:

avatar-bot

The sound of a growling lion is a deep, rumbling noise that starts low in the chest and gradually builds in intensity. It is often associated with a sense of power and strength, as well as a sense of danger or warning. The growl of a lion can be heard from a distance and is often used by the animal to communicate with other members of its pride.

avatar-user

Displaying an image to help visualize it would be fantastic.

avatar-bot

Sure, here is an image of a lion growling:


avatar-user

That looks great! How about showing a video that explains how the lions catch its prey?

avatar-bot

• Example-6: Text → Text+Image+Video+Audio

avatar-user

I've recently taken up painting as a hobby, and I'm looking for some inspiration. Can you help me find some interesting ideas?

avatar-bot

Of course! Painting is a wonderful creative outlet. What kind of subjects are you interested in exploring? Landscapes, portraits, abstract art, or something else?

avatar-user

Abstract art

avatar-bot

Abstract art sounds great. lt allows you to express your emotions and ideas without any specific subject matter. You can experiment with different techniques like brushstrokes, color mixing, and layeringHere's an image of an abstract painting that might inspire you:


avatar-user

And also water painting. maybe l could start by watching some teaching videos.

avatar-bot

That's good idea, please check the tutorial video:

avatar-user

Fantastic! Can you describe the video?

avatar-bot

Of course! Tha video is a tutorial on watercolor painting. lt covers the basics ot mixing colors, creating textures and layering techniques. The instructor demonstrates different brushstrokes and techniques to achieve various effects.

avatar-user

That's good. So I want to see some water painting pictures.

avatar-bot

Here are some watercolor paintings:

avatar-user

That's so amazing painting. I wish I could do that. Can you show me some tips about water painting?

avatar-bot

Sure, here are a few tips for water painting:

1. Use a wet brush to create soft, blended strokes.
2. Experiment with different watercolor techniques such as wet-on-wet, wet-on-dry, and wet-of-wet.
3. Pay attention to the transparency of the watercolor and use thin layers to create depth.
4. Use a variety of brushes with different shapes and sizes to achieve different effects.
5. Explore the use of negative space to create a sense of depth and dimension.
6. Don't be afraid to make mistakes and embrace the unpredictability of watercolor.

avatar-user

If l were painting, l am used to play some musics. I think that is helpful for my creation.

avatar-bot

Totally agreed! Music helps a lot for art creation, especially some inspiring music:


• Example-7: Text → Text+Image

avatar-user

I want a tattoo of a sunflower on my arm, but l am not sure what style would look best on me.

avatar-bot

A sunflower tattoo can be done in various styles, such as realistic, abstract, or even minimalist. lt all depends on your personal taste and the design you want to go with. You can also consider the color scheme and placement of the tattoo to make it look more natural and harmonious with your body. Would you like me to provide you with some examples of sunflower tattoo designs?

avatar-user

Yes please!

avatar-bot

Here are a few examples of sunflower tattoo designs:

1. A realistic sunflower tattoo with vibrant petals and leaves, surrounded by a detailed background.
2. An abstract sunflower tattoo, with a minimalist design that focuses on the shape and color of the flower.
3. A simple sunflower tattoo with a single sunflower petal, representing the beauty and simplicity of the flower.

avatar-user

Good tips, can you show me the picture the design?

avatar-bot

Sure, here are the images of the sunflower tattoo designs: 1. 2. 3.


• Example-8: Text+Video → Text+Image+Audio

avatar-user

What is the unusual part of this video?

avatar-bot

The unusual part of this video is that a dog is riding a skateboard. It is not common to see a dog on a skateboard, as dogs are typically not able to ride skateboards or bicycles. This scene is unique and interesting because it showcases a dog's ability to adapt and participate in human-made activities, demonstrating its intelligence and adaptability.

avatar-user

That is pretty cool. How about adding some music to make this scene look more funny?

avatar-bot

Certainly! Here's a synthesized music track that could enhance the humor and lightheartedness of the scene:


avatar-user

Do you have any other similar unusual scenes? Show them.

avatar-bot

I have a set of such kinds of scenes:

avatar-user

I had an idea for one scene - what if a pig is flying in the sky?

avatar-bot

It is not a common sight to see a pig or any other animal, let alone a pig, flying in the air. This scene is likely meant to be humorous or whimsical, as it is an unconventional and lighthearted depiction of a pig enjoying its time in the sky.


• Example-9: Text → Text+Image+Video+Audio

avatar-user

I'm planning a trip to New Orleans, and l'd love some recommendations for fun things to do and places to visit.

avatar-bot

New Orleans is a vibrant city with a rich history and culture. Some popular attractions include the French Quarter, the National WWiMuseum, and the St. Louis Cathedral. You can also explore the historic Garden District and visit the iconic Bourbon Street for some night life. Additionally, don't miss the opportunity to try some delicious Cajun cuisine!

avatar-user

What kind of cajun cuisine? can you show me some pictures?

avatar-bot

Certainly! Cajun cuisine is known for its hearty and flavorful dishes. Here are some pictures to give you a taste of what you can expect:


avatar-user

I've heard that New Orleans is famous for jazz music. Show me a piece of jazz music please.

avatar-bot

You are right. Here is a jazz music:

avatar-user

Fantastic! I often see videos in which a bunch of people dance to jazz music, can you show me some?

avatar-bot

Of course!


• Example-10: Text → Text+Image

avatar-user

Now I want to start a pet shop, but I need a logo for my shop. Do you have any ideas?

avatar-bot

Certainly! I can help you create a logo for your pet shop. To do this I would need to know more about the specific theme or style you'd like to have for your pet shop. Factors such as color scheme, animal types, and overall atmosphere could play a role in designing the logo.

avatar-user

It is a turtle pet shop, so the logo could be a turtle.

avatar-bot

I see, then, how about a logo like this?

avatar-user

It's cute, but I'm aiming for a logo that's more distinctive and has a cool style.

avatar-bot

Understand. How about this one?

avatar-user

It's nice. Can you explain this logo? and why is it designed like this?

avatar-bot

The use of a green turtle, which is a distinctive and easily recognizable symbol, adds a sense of uniqueness and character to the logo. Additionally, the circular shape of the logo gives it a modern and clean appearance, making it stand out from other pet shop logosThe combination of these elements gives the logo a cool style that is likely intended to appeal to potential customers and convey a sense of fun, creativity, and innovation.


• Example-11: Text+Image → Text+Image

avatar-user

Where is the sculpture located?

avatar-bot

It is the Statue of Liberty. The sculpture is located on the grounds of the Liberty Island in New York City.

avatar-user

Can you show me a closer look for the sculpture.

avatar-bot

Certainly!


• Example-12: Text+Image → Text+Audio

avatar-user

How many cars are in this picture?

avatar-bot

There are two cars shown in this picture.

avatar-user

Tell me what kind of car are they?

avatar-bot

The two cars in the picture are SUVs.

avatar-user

Can you show me the sound of starting the engines of them.

avatar-bot

Sure, here's the sound of the engine of the SUV:


Related Links

You may refer to related work that serves as foundations for our framework and code repository, such as Vicuna, ImageBind, Stable Diffusion, AudioLDM, and Zeroscope. We also partially draw inspirations from CoDi, VPGTrans, Video-LLaMA, PandaGPT, GILL, and MiniGPT-4.

BibTeX

@inproceedings{wu24next,
  title={NExT-GPT: Any-to-Any Multimodal LLM},
  author={Wu, Shengqiong and Fei, Hao and Qu, Leigang and Ji, Wei and Chua, Tat-Seng},
  booktitle={Proceedings of the International Conference on Machine Learning},
  pages = {53366--53397},
  year={2024}
}