Skip to content

Latest commit

 

History

History

llava-train_videochat

👀How to train and evaluate VideoChat-Flash?🦜

1. Prepare Training Data

We need to address the fact that our data has been collected and used in different projects/people. For the data that has already been uploaded, we will refer you the corresponding viewing locations. Please collect relevant data fragments and integrate them in your own environments. We use similar data format with LLaVA-NeXT. You can customize your own training data in this format.

In data, we have provided the data used in each training stage, along with the corresponding annotation locations. We have made all the data annotations and some of the videos available on OpenGVLab/VideoChat-Flash-Training-Data, and I have listed all video source url in the annotation file.

2. Training

Stage Num. frames ViT Connector LLM CKPT
stage1 4 ❄️ 🔥 ❄️ all projector weights
stage2 4-8 🔥 🔥 🔥 UMT-Qwen2_7B, UMT-Qwen2_5_1M_7B, UMT-HD-Qwen2_5_2B, InternVideo2-Qwen2_5_7B
stage3 64-512 🔥 🔥 🔥 UMT-Qwen2_7B,UMT-HD-Qwen2_5-2B,UMT-Qwen2_5_1M_7B, InternVideo2-Qwen2_5_7B
stage4 64-512 🔥 🔥 ❄️ UMT-HD-Qwen2-7B

Training time with a 32 A100:

  • stage1: under one hour:
  • stage2: about 2 day
  • stage3: about 2~3day
  • stage4: about 2~3day

We recommend to start from stage3 based on our provided stage2 model to save training cost, and you could use 1/4 stage3 data for ablation (as we do)! You also could ignore stage4 if you don't need a absolute SoTA performance!

We use slurm to train model on multple machines, if you only have one machines or you don't use slurm, please refer to LLaVA-NeXT to modify the scripts.

Install

git clone https://github.com/OpenGVLab/VideoChat-Flash
cd llava-train_videochat
pip install -e .

Stage-1: Video-Language Alignment

Please download pretrained video encoders in Huggingfaces first. Then modify ckpt_path in build_vit of llava/model/multimodal_encoder/umt_encoder.py or llava/model/multimodal_encoder/internvideo2_encoder.py.

bash scripts/train/stage1-init_connector/stage1_umt_tome16_res224_qwen7b.sh

Stage-2: Short Video Pre-training

bash scripts/train/stage2-visual_pretraining/stage2_umt_tome16_res224_qwen_7b.sh

Stage-3: Joint Short & Long Video Instruction Tuning

bash scripts/train/stage3-video_sft/stage3_umt_tome16_res224_qwen_7b.sh

Stage-4: Efficient High-Resolution Post-finetuning

Please modify vision_tower="umt-hd-large" in Your_stage3_checkpoint_path/config.json first!

bash scripts/train/stage4_highres_postft/stage4_umt_tome16_res448_qwen_7b.sh

Evaluation

Overwrite your checkpoints directory with the configurations and Python files from OpenGVLab/VideoChat-Flash, and then you can use the lmms-eval_videochat we provided for evaluation.