Site icon PicDataset

Overview of LLaVA: Large Language and Vision Assistant

llava large language and vision assistantllava large language and vision assistant

LLaVA is an open-source project that aims to build large multimodal models with capabilities approaching GPT-4 in both vision and language understanding. The key idea is to use language-only GPT-4 to generate visual instruction tuning data, and use this data to teach the multimodal model to follow instructions across vision and language.

Dataset

The core dataset for LLaVA is the LLaVA Visual Instruct 150K dataset. This is a collection of 158K multimodal instruction-following examples generated by interacting with GPT-4. It consists of 3 subsets:

Some key details about the dataset:

Model

LLaVA connects a visual encoder (CLIP ViT-L/14) with a large language model (Vicuna). It is trained in two stages:

  1. Feature alignment pretraining: Align visual and text features using a subset of CC3M data
     # Pretrain (feature alignment)
     cc3m_subset_path = "/path/to/cc3m/subset" 
    
     from transformers import LLaVaForConditionalGeneration
    
     model = LLaVaForConditionalGeneration.from_pretrained("liuhaotian/llava-llama-2-13b-v0")
    
     model.config.decoder_start_token_id = model.config.pad_token_id
    
     model.resize_token_embeddings(len(model.config.decoder.vocab))
    
     model.config.img_size = 336
    
     model.config.prompt_version = "v1" 
    
     model.parallelize()
    
     model.requires_grad_(model.visual, False) 
    
     model.requires_grad_(model.text_model, False)
    
     model.requires_grad_(model.visual_projection, True)
    
     optimizer = create_optimizer(model)
    
     num_epochs = 1
    
     model.train()
     for epoch in range(num_epochs):
         for batch in dataloader:  
             # forward pass
             loss = model(**batch, output_hidden_states=True).loss
    
             # backward pass
             loss.backward()
    
             # update parameters
             optimizer.step()
             optimizer.zero_grad()
    
  2. Visual instruction tuning: Fine-tune end-to-end on LLaVA Visual Instruct 150K
     # Visual Instruction Tuning
    
     instruct_data_path = "/path/to/llava/instruct/data"
    
     from transformers import LLaVaForConditionalGeneration
    
     model = LLaVaForConditionalGeneration.from_pretrained("liuhaotian/llava-llama-2-13b-v0")
    
     model.parallelize()
    
     model.requires_grad_(True)
    
     optimizer = create_optimizer(model)
    
     num_epochs = 3
    
     model.train()
     for epoch in range(num_epochs):
         for batch in dataloader:
             # forward pass
             loss = model(**batch, output_hidden_states=True).loss
    
             # backward pass
             loss.backward()
    
             # update parameters
             optimizer.step()
             optimizer.zero_grad()
    

This allows LLaVA to demonstrate impressive vision-language capabilities.

Code and Usage

The LLaVA codebase provides implementations for training, evaluation and serving the models:

It also includes utilities for quantization, multi-GPU support etc.

Key links:

Conclusion

In summary, LLaVA demonstrates how visual instruction tuning data generated by GPT-4 can help train large multimodal models to achieve impressive vision-language abilities. The dataset, codebase and models enable further research into building models that approach capabilities like multimodal GPT-4.

Exit mobile version