Prompting Future Driven Diffusion Model for Hand Motion Prediction

Tang, Bowen; Zhang, Kaihao; Luo, Wenhan; Liu, Wei; Li, Hongdong

doi:10.1007/978-3-031-72667-5_10

Bowen Tang¹³,
Kaihao Zhang¹⁴,
Wenhan Luo¹⁵,
Wei Liu¹⁶ &
…
Hongdong Li¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15065))

Included in the following conference series:

European Conference on Computer Vision

457 Accesses

Abstract

Hand motion prediction from both first- and third-person perspectives is vital for enhancing user experience in AR/VR and ensuring safe remote robotic arm control. Previous works typically focus on predicting hand motion trajectories or human body motion, with direct hand motion prediction remaining largely unexplored - despite the additional challenges posed by compact skeleton size. To address this, we propose a prompt-based Future Driven Diffusion Model (PromptFDDM) for predicting hand motion with guidance and prompts. Specifically, we develop a Spatial-Temporal Extractor Network (STEN) to predict hand motion with guidance, a Ground Truth Extractor Network (GTEN), and a Reference Data Generator Network (RDGN), which extract ground truth and substitute future data with generated reference data, respectively, to guide STEN. Additionally, interactive prompts generated from observed motions further enhance model performance. Experimental results on the FPHA and HO3D datasets demonstrate that the proposed PromptFDDM achieves state-of-the-art performance in both first- and third-person perspectives.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SeqHAND: RGB-Sequence-Based 3D Hand Pose and Shape Estimation

3D Human Pose Estimation via Non-causal Retentive Networks

GIMO: Gaze-Informed Human Motion Prediction in Context

References

Aksan, E., Kaufmann, M., Cao, P., Hilliges, O.: A spatio-temporal transformer for 3D human motion prediction. In: 3DV (2021)
Google Scholar
Aksan, E., Kaufmann, M., Hilliges, O.: Structured prediction helps 3D human motion modelling. In: ICCV (2019)
Google Scholar
Aliakbarian, M.S., Saleh, F.S., Salzmann, M., Petersson, L., Gould, S.: A stochastic conditioning scheme for diverse human motion prediction. In: CVPR (2020)
Google Scholar
Bao, W., et al.: Uncertainty-aware state space transformer for egocentric 3D hand trajectory forecasting. In: CVPR (2023)
Google Scholar
Barquero, G., Escalera, S., Palmero, C.: BeLFusion: latent diffusion for behavior-driven human motion prediction. In: ICCV (2023)
Google Scholar
Barsoum, E., Kender, J.R., Liu, Z.: HP-GAN: probabilistic 3D human motion prediction via GAN. In: CVPRW (2018)
Google Scholar
Bütepage, J., Black, M.J., Kragic, D., Kjellström, H.: Deep representation learning for human motion prediction and classification. In: CVPR (2017)
Google Scholar
Cai, Y., et al.: Learning progressive joint propagation for human motion prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 226–242. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_14
Chapter Google Scholar
Cai, Y., et al.: A unified 3D human motion synthesis model via conditional variational auto-encoder*. In: ICCV (2021)
Google Scholar
Cao, Z., Radosavovic, I., Kanazawa, A., Malik, J.: Reconstructing hand-object interactions in the wild. In: ICCV (2021)
Google Scholar
Chen, L., Zhang, J., Li, Y., Pang, Y., Xia, X., Liu, T.: HumanMAC: masked motion completion for human motion prediction. In: ICCV (2023)
Google Scholar
Chen, M., Wei, Z., Huang, Z., Ding, B., Li, Y.: Simple and deep graph convolutional networks. In: ICML (2020)
Google Scholar
Dang, L., Nie, Y., Long, C., Zhang, Q., Li, G.: MSR-GCN: multi-scale residual graph convolution networks for human motion prediction. In: ICCV (2021)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL HLT (2019)
Google Scholar
Dhariwal, P., Nichol, A.Q.: Diffusion models beat GANs on image synthesis. In: NeurIPS (2021)
Google Scholar
Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV (2015)
Google Scholar
Gamage, N.M., Ishtaweera, D., Weigel, M., Withana, A.: So predictable! continuous 3D hand trajectory prediction in virtual reality. In: ACM International Conference on User Interface Software and Technology (2021)
Google Scholar
Garcia-Hernando, G., Yuan, S., Baek, S., Kim, T.: First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. In: CVPR (2018)
Google Scholar
Goodfellow, I.J., et al.: Generative adversarial nets. In: NeurIPS (2014)
Google Scholar
Gourob, J.H., Raxit, S., Hasan, A.: A robotic hand: controlled with vision based hand gesture recognition system. In: International Conference on Automation, Control and Mechatronics for Industry (ACMI) (2021)
Google Scholar
Gui, L.-Y., Wang, Y.-X., Liang, X., Moura, J.M.F.: Adversarial geometry-aware human motion prediction. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 823–842. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_48
Chapter Google Scholar
Guo, X., Choi, J.: Human motion prediction via learning local structure representations and temporal dependencies. In: AAAI (2019)
Google Scholar
Hampali, S., Rad, M., Oberweger, M., Lepetit, V.: HOnnotate: a method for 3D annotation of hand and object poses. In: CVPR (2020)
Google Scholar
Hasson, Y., Tekin, B., Bogo, F., Laptev, I., Pollefeys, M., Schmid, C.: Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction. In: CVPR (2020)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Google Scholar
Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: ICML (2019)
Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR (2018)
Google Scholar
Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-RNN: deep learning on spatio-temporal graphs. In: CVPR (2016)
Google Scholar
Jia, M., et al.: Visual prompt tuning. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13693, pp. 709–727. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19827-4_41
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: ICLR (2014)
Google Scholar
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017)
Google Scholar
Koppula, H.S., Saxena, A.: Anticipating human activities for reactive robotic response. In: IROS (2013)
Google Scholar
Koppula, H.S., Saxena, A.: Anticipating human activities using object affordances for reactive robotic response. IEEE TPAMI (2016)
Google Scholar
Kundu, J.N., Gor, M., Babu, R.V.: BiHMP-GAN: bidirectional 3D human motion prediction GAN. In: AAAI (2019)
Google Scholar
Lehrmann, A.M., Gehler, P.V., Nowozin, S.: Efficient nonlinear Markov models for human motion. In: CVPR (2014)
Google Scholar
Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning. In: EMNLP (2021)
Google Scholar
Li, C., Zhang, Z., Lee, W.S., Lee, G.H.: Convolutional sequence to sequence model for human dynamics. In: CVPR (2018)
Google Scholar
Li, M., Chen, S., Zhao, Y., Zhang, Y., Wang, Y., Tian, Q.: Dynamic multiscale graph neural networks for 3D skeleton based human motion prediction. In: CVPR (2020)
Google Scholar
Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation. In: ACL IJCNLP (2021)
Google Scholar
Li, Y., et al.: Egocentric prediction of action target in 3D. In: CVPR (2022)
Google Scholar
Liu, S., Tripathi, S., Majumdar, S., Wang, X.: Joint hand motion and interaction hotspots prediction from egocentric videos. In: CVPR (2022)
Google Scholar
Mangukiya, Y., Purohit, B., George, K.: Electromyography (EMG) sensor controlled assistive orthotic robotic arm for forearm movement. In: IEEE Sensors Applications Symposium (SAS) (2017)
Google Scholar
Mao, W., Liu, M., Salzmann, M.: History repeats itself: human motion prediction via motion attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 474–489. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_28
Chapter Google Scholar
Mao, W., Liu, M., Salzmann, M.: Generating smooth pose sequences for diverse human motion prediction. In: ICCV (2021)
Google Scholar
Mao, W., Liu, M., Salzmann, M., Li, H.: Learning trajectory dependencies for human motion prediction. In: ICCV (2019)
Google Scholar
Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: CVPR (2017)
Google Scholar
Martínez-González, Á., Villamizar, M., Odobez, J.: Pose transformers (POTR): human motion prediction with non-autoregressive transformers. In: ICCVW (2021)
Google Scholar
Nichol, A.Q., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In: ICML (2022)
Google Scholar
Paden, B., Cáp, M., Yong, S.Z., Yershov, D.S., Frazzoli, E.: A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Trans. Intell. Veh. 1(1), 33–55 (2016)
Article Google Scholar
Perez, E., Strub, F., de Vries, H., Dumoulin, V., Courville, A.C.: Film: Visual reasoning with a general conditioning layer. In: AAAI (2018)
Google Scholar
Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI Blog (2019)
Google Scholar
Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. In: ICML (2015)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Ruiz, A.H., Gall, J., Moreno, F.: Human motion prediction via spatio-temporal inpainting. In: ICCV (2019)
Google Scholar
Saadatnejad, S., et al.: A generic diffusion-based approach for 3D human pose prediction in the wild. In: ICLR (2023)
Google Scholar
Sofianos, T., Sampieri, A., Franco, L., Galasso, F.: Space-time-separable graph convolutional network for pose forecasting. In: ICCV (2021)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)
Google Scholar
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. In: ICLR (2023)
Google Scholar
von Tiesenhausen, J., Artan, U., Marshall, J.A., Li, Q.: Hand gesture-based control of a front-end loader. In: IEEE Canadian Conference on Electrical and Computer Engineering (CCECE) (2020)
Google Scholar
Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: video forecasting by generating pose futures. In: ICCV (2017)
Google Scholar
Wang, B., Adeli, E., Chiu, H., Huang, D., Niebles, J.C.: Imitation learning for human pose prediction. In: ICCV (2019)
Google Scholar
Wang, J.M., Fleet, D.J., Hertzmann, A.: Gaussian process dynamical models for human motion. IEEE TPAMI (2008)
Google Scholar
Wang, Z., et al.: Learning to prompt for continual learning. In: CVPR (2022)
Google Scholar
Wei, D., et al.: Human joint kinematics diffusion-refinement for stochastic motion prediction. In: AAAI (2023)
Google Scholar
Xu, S., Wang, Y., Gui, L.: Diverse human motion prediction guided by multi-level spatial-temporal anchors. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13682. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_15
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI (2018)
Google Scholar
Yan, X., et al.: MT-VAE: learning motion transformations to generate multimodal human dynamics. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 276–293. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_17
Chapter Google Scholar
Ye, Y., Gupta, A., Tulsiani, S.: What’s in your hands? 3D reconstruction of generic objects in hands. In: CVPR (2022)
Google Scholar
Yuan, Y., Kitani, K.: DLow: diversifying latent flows for diverse human motion prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 346–364. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_20
Chapter Google Scholar
Zand, M., Etemad, A., Greenspan, M.A.: Flow-based spatio-temporal structured prediction of motion dynamics. IEEE TPAMI (2023)
Google Scholar
Zhang, J., et al.: Generating human motion from textual descriptions with discrete representations. In: CVPR (2023)
Google Scholar
Zhang, M., et al.: MotionDiffuse: text-driven human motion generation with diffusion model. IEEE TPAMI 46(6), 4115–4128 (2024)
Article Google Scholar
Zhang, Y., Black, M.J., Tang, S.: We are more than our joints: predicting how 3D bodies move. In: CVPR (2021)
Google Scholar
Zhong, C., Hu, L., Zhang, Z., Ye, Y., Xia, S.: Spatio-temporal gating-adjacency GCN for human motion prediction. In: CVPR (2022)
Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: CVPR (2022)
Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. IJCV 130, 2337–2348 (2022)
Article Google Scholar

Download references

Acknowledgment

This work is funded in part by the National Natural Science Foundation of China (Grant No. 62372480), in part by ARC-Discovery (DP 220100800), in part by CCF-Tencent Rhino-Bird Open Research Fund (No. CCF-Tencent RAGR20230118).

Author information

Authors and Affiliations

Australian National University, Canberra, Australia
Bowen Tang & Hongdong Li
Harbin Institute of Technology, Shenzhen, Shenzhen, China
Kaihao Zhang
The Hong Kong University of Science and Technology, Hong Kong, China
Wenhan Luo
Tencent, Shenzhen, China
Wei Liu

Authors

Bowen Tang
View author publications
You can also search for this author in PubMed Google Scholar
Kaihao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wenhan Luo
View author publications
You can also search for this author in PubMed Google Scholar
Wei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hongdong Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Kaihao Zhang or Wenhan Luo .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tang, B., Zhang, K., Luo, W., Liu, W., Li, H. (2025). Prompting Future Driven Diffusion Model for Hand Motion Prediction. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15065. Springer, Cham. https://doi.org/10.1007/978-3-031-72667-5_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-72667-5_10
Published: 29 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72666-8
Online ISBN: 978-3-031-72667-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Prompting Future Driven Diffusion Model for Hand Motion Prediction