Multimodal Interaction & World Model
The Doubao Multimodal Interaction and World Model team is dedicated to developing models that boast human-level multimodal understanding and interaction capabilities. The team also aspires to advance the exploration and development of multimodal assistant products.

Research topics

Foundations and applications of multimodal understanding models
Develop integrated models that understand audio-visual and linguistic inputs, enhance fundamental understanding of images and videos such as text, layout, grounding, and spatial relation as well as multimodal reasoning capabilities. Improve the efficiency of model training and inference, achieve long-term memory retention for users, and optimize the model's performance across various devices for better experience.
Multimodal
Foundation

Multimodal agent and inference
Achieve advanced capabilities for multimodal models including multimodal RAG, visual COT, and agent, building general multimodal agents for GUI/games in the virtual world.
Multimodal
Foundation
Agent

Unified models for generation and understanding
Explore unified representation and training methods for both continuous and discrete signals, and develop models that can interleave both generation and understanding.
Multimodal
World Model

World Model
Employ pre-training and simulation technologies to model various environments of the virtual and physical world, providing foundational capabilities for multimodal interactive exploration.
Multimodal
World Model
Technical applications

Seed-VLM
Seed-VLM is an advanced visual assistant designed for Doubao's scenarios. It ensures dependable performance through post-training and enhances the user experience with its comprehensive features, incorporating visual chain of thought (Visual CoT).
Visual-Language Model