Project page: https://sqwu.top/Any2Cap/
We present Any2Caption
, a novel framework for controllable video generation from any condition. The key idea is decoupling various condition interpretation steps from the video synthesis step. By leveraging modern multimodal large language models (MLLMs), Any2Caption interprets diverse inputs—text, images, videos, and specialized cues such as region, motion, and camera poses—into dense, structured captions that offer backbone video generators with better guidance.
Stay Tuned.
If you find Any2Caotion is useful and use it in your project, please kindly cite:
@inproceedings{wu2025Any2Caption,
title={Any2Caption: Interpreting Any Condition to Caption for Controllable Video Generation},
author={Shengqiong Wu and Weicai Ye and Jiahao Wang and Quande Liu and Xintao Wang and Pengfei Wan and Di Zhang and Kun Gai and Shuicheng Yan and Hao Fei and Tat-Seng Chua},
booktitle={arxiv},
year={2025}
}