Name		Name	Last commit message	Last commit date
parent directory ..
data		data
llava		llava
scripts		scripts
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

README.md

👀How to train and evaluate VideoChat-Flash?🦜

1. Prepare Training Data

We need to address the fact that our data has been collected and used in different projects/people. For the data that has already been uploaded, we will refer you the corresponding viewing locations. Please collect relevant data fragments and integrate them in your own environments. We use similar data format with LLaVA-NeXT. You can customize your own training data in this format.

In data, we have provided the data used in each training stage, along with the corresponding annotation locations. We have made all the data annotations and some of the videos available on OpenGVLab/VideoChat-Flash-Training-Data, and I have listed all video source url in the annotation file.

2. Training

Stage	Num. frames	ViT	Connector	LLM	CKPT
stage1	4	❄️	🔥	❄️	all projector weights
stage2	4-8	🔥	🔥	🔥	UMT-Qwen2_7B, UMT-Qwen2_5_1M_7B, UMT-HD-Qwen2_5_2B, InternVideo2-Qwen2_5_7B
stage3	64-512	🔥	🔥	🔥	UMT-Qwen2_7B,UMT-HD-Qwen2_5-2B,UMT-Qwen2_5_1M_7B, InternVideo2-Qwen2_5_7B
stage4	64-512	🔥	🔥	❄️	UMT-HD-Qwen2-7B

Training time with a 32 A100:

stage1: under one hour:
stage2: about 2 day
stage3: about 2~3day
stage4: about 2~3day

We recommend to start from stage3 based on our provided stage2 model to save training cost, and you could use 1/4 stage3 data for ablation (as we do)! You also could ignore stage4 if you don't need a absolute SoTA performance!

We use slurm to train model on multple machines, if you only have one machines or you don't use slurm, please refer to LLaVA-NeXT to modify the scripts.

Install

git clone https://github.com/OpenGVLab/VideoChat-Flash
cd llava-train_videochat
pip install -e .

Stage-1: Video-Language Alignment

Please download pretrained video encoders in Huggingfaces first. Then modify ckpt_path in build_vit of llava/model/multimodal_encoder/umt_encoder.py or llava/model/multimodal_encoder/internvideo2_encoder.py.

bash scripts/train/stage1-init_connector/stage1_umt_tome16_res224_qwen7b.sh

Stage-2: Short Video Pre-training

bash scripts/train/stage2-visual_pretraining/stage2_umt_tome16_res224_qwen_7b.sh

Stage-3: Joint Short & Long Video Instruction Tuning

bash scripts/train/stage3-video_sft/stage3_umt_tome16_res224_qwen_7b.sh

Stage-4: Efficient High-Resolution Post-finetuning

Please modify vision_tower="umt-hd-large" in Your_stage3_checkpoint_path/config.json first!

bash scripts/train/stage4_highres_postft/stage4_umt_tome16_res448_qwen_7b.sh

Evaluation

Overwrite your checkpoints directory with the configurations and Python files from OpenGVLab/VideoChat-Flash, and then you can use the lmms-eval_videochat we provided for evaluation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llava-train_videochat

llava-train_videochat

README.md

👀How to train and evaluate VideoChat-Flash?🦜

1. Prepare Training Data

2. Training

Install

Stage-1: Video-Language Alignment

Stage-2: Short Video Pre-training

Stage-3: Joint Short & Long Video Instruction Tuning

Stage-4: Efficient High-Resolution Post-finetuning

Evaluation

Files

llava-train_videochat

Directory actions

More options

Directory actions

More options

Latest commit

History

llava-train_videochat

Folders and files

parent directory

README.md

👀How to train and evaluate VideoChat-Flash?🦜

1. Prepare Training Data

2. Training

Install

Stage-1: Video-Language Alignment

Stage-2: Short Video Pre-training

Stage-3: Joint Short & Long Video Instruction Tuning

Stage-4: Efficient High-Resolution Post-finetuning

Evaluation