Skip to main content

Prompting Future Driven Diffusion Model for Hand Motion Prediction

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15065))

Included in the following conference series:

  • 457 Accesses

Abstract

Hand motion prediction from both first- and third-person perspectives is vital for enhancing user experience in AR/VR and ensuring safe remote robotic arm control. Previous works typically focus on predicting hand motion trajectories or human body motion, with direct hand motion prediction remaining largely unexplored - despite the additional challenges posed by compact skeleton size. To address this, we propose a prompt-based Future Driven Diffusion Model (PromptFDDM) for predicting hand motion with guidance and prompts. Specifically, we develop a Spatial-Temporal Extractor Network (STEN) to predict hand motion with guidance, a Ground Truth Extractor Network (GTEN), and a Reference Data Generator Network (RDGN), which extract ground truth and substitute future data with generated reference data, respectively, to guide STEN. Additionally, interactive prompts generated from observed motions further enhance model performance. Experimental results on the FPHA and HO3D datasets demonstrate that the proposed PromptFDDM achieves state-of-the-art performance in both first- and third-person perspectives.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Aksan, E., Kaufmann, M., Cao, P., Hilliges, O.: A spatio-temporal transformer for 3D human motion prediction. In: 3DV (2021)

    Google Scholar 

  2. Aksan, E., Kaufmann, M., Hilliges, O.: Structured prediction helps 3D human motion modelling. In: ICCV (2019)

    Google Scholar 

  3. Aliakbarian, M.S., Saleh, F.S., Salzmann, M., Petersson, L., Gould, S.: A stochastic conditioning scheme for diverse human motion prediction. In: CVPR (2020)

    Google Scholar 

  4. Bao, W., et al.: Uncertainty-aware state space transformer for egocentric 3D hand trajectory forecasting. In: CVPR (2023)

    Google Scholar 

  5. Barquero, G., Escalera, S., Palmero, C.: BeLFusion: latent diffusion for behavior-driven human motion prediction. In: ICCV (2023)

    Google Scholar 

  6. Barsoum, E., Kender, J.R., Liu, Z.: HP-GAN: probabilistic 3D human motion prediction via GAN. In: CVPRW (2018)

    Google Scholar 

  7. Bütepage, J., Black, M.J., Kragic, D., Kjellström, H.: Deep representation learning for human motion prediction and classification. In: CVPR (2017)

    Google Scholar 

  8. Cai, Y., et al.: Learning progressive joint propagation for human motion prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 226–242. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_14

    Chapter  Google Scholar 

  9. Cai, Y., et al.: A unified 3D human motion synthesis model via conditional variational auto-encoder*. In: ICCV (2021)

    Google Scholar 

  10. Cao, Z., Radosavovic, I., Kanazawa, A., Malik, J.: Reconstructing hand-object interactions in the wild. In: ICCV (2021)

    Google Scholar 

  11. Chen, L., Zhang, J., Li, Y., Pang, Y., Xia, X., Liu, T.: HumanMAC: masked motion completion for human motion prediction. In: ICCV (2023)

    Google Scholar 

  12. Chen, M., Wei, Z., Huang, Z., Ding, B., Li, Y.: Simple and deep graph convolutional networks. In: ICML (2020)

    Google Scholar 

  13. Dang, L., Nie, Y., Long, C., Zhang, Q., Li, G.: MSR-GCN: multi-scale residual graph convolution networks for human motion prediction. In: ICCV (2021)

    Google Scholar 

  14. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL HLT (2019)

    Google Scholar 

  15. Dhariwal, P., Nichol, A.Q.: Diffusion models beat GANs on image synthesis. In: NeurIPS (2021)

    Google Scholar 

  16. Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV (2015)

    Google Scholar 

  17. Gamage, N.M., Ishtaweera, D., Weigel, M., Withana, A.: So predictable! continuous 3D hand trajectory prediction in virtual reality. In: ACM International Conference on User Interface Software and Technology (2021)

    Google Scholar 

  18. Garcia-Hernando, G., Yuan, S., Baek, S., Kim, T.: First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. In: CVPR (2018)

    Google Scholar 

  19. Goodfellow, I.J., et al.: Generative adversarial nets. In: NeurIPS (2014)

    Google Scholar 

  20. Gourob, J.H., Raxit, S., Hasan, A.: A robotic hand: controlled with vision based hand gesture recognition system. In: International Conference on Automation, Control and Mechatronics for Industry (ACMI) (2021)

    Google Scholar 

  21. Gui, L.-Y., Wang, Y.-X., Liang, X., Moura, J.M.F.: Adversarial geometry-aware human motion prediction. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 823–842. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_48

    Chapter  Google Scholar 

  22. Guo, X., Choi, J.: Human motion prediction via learning local structure representations and temporal dependencies. In: AAAI (2019)

    Google Scholar 

  23. Hampali, S., Rad, M., Oberweger, M., Lepetit, V.: HOnnotate: a method for 3D annotation of hand and object poses. In: CVPR (2020)

    Google Scholar 

  24. Hasson, Y., Tekin, B., Bogo, F., Laptev, I., Pollefeys, M., Schmid, C.: Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction. In: CVPR (2020)

    Google Scholar 

  25. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)

    Google Scholar 

  26. Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: ICML (2019)

    Google Scholar 

  27. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR (2018)

    Google Scholar 

  28. Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-RNN: deep learning on spatio-temporal graphs. In: CVPR (2016)

    Google Scholar 

  29. Jia, M., et al.: Visual prompt tuning. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13693, pp. 709–727. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19827-4_41

  30. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)

    Google Scholar 

  31. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: ICLR (2014)

    Google Scholar 

  32. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017)

    Google Scholar 

  33. Koppula, H.S., Saxena, A.: Anticipating human activities for reactive robotic response. In: IROS (2013)

    Google Scholar 

  34. Koppula, H.S., Saxena, A.: Anticipating human activities using object affordances for reactive robotic response. IEEE TPAMI (2016)

    Google Scholar 

  35. Kundu, J.N., Gor, M., Babu, R.V.: BiHMP-GAN: bidirectional 3D human motion prediction GAN. In: AAAI (2019)

    Google Scholar 

  36. Lehrmann, A.M., Gehler, P.V., Nowozin, S.: Efficient nonlinear Markov models for human motion. In: CVPR (2014)

    Google Scholar 

  37. Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning. In: EMNLP (2021)

    Google Scholar 

  38. Li, C., Zhang, Z., Lee, W.S., Lee, G.H.: Convolutional sequence to sequence model for human dynamics. In: CVPR (2018)

    Google Scholar 

  39. Li, M., Chen, S., Zhao, Y., Zhang, Y., Wang, Y., Tian, Q.: Dynamic multiscale graph neural networks for 3D skeleton based human motion prediction. In: CVPR (2020)

    Google Scholar 

  40. Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation. In: ACL IJCNLP (2021)

    Google Scholar 

  41. Li, Y., et al.: Egocentric prediction of action target in 3D. In: CVPR (2022)

    Google Scholar 

  42. Liu, S., Tripathi, S., Majumdar, S., Wang, X.: Joint hand motion and interaction hotspots prediction from egocentric videos. In: CVPR (2022)

    Google Scholar 

  43. Mangukiya, Y., Purohit, B., George, K.: Electromyography (EMG) sensor controlled assistive orthotic robotic arm for forearm movement. In: IEEE Sensors Applications Symposium (SAS) (2017)

    Google Scholar 

  44. Mao, W., Liu, M., Salzmann, M.: History repeats itself: human motion prediction via motion attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 474–489. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_28

    Chapter  Google Scholar 

  45. Mao, W., Liu, M., Salzmann, M.: Generating smooth pose sequences for diverse human motion prediction. In: ICCV (2021)

    Google Scholar 

  46. Mao, W., Liu, M., Salzmann, M., Li, H.: Learning trajectory dependencies for human motion prediction. In: ICCV (2019)

    Google Scholar 

  47. Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: CVPR (2017)

    Google Scholar 

  48. Martínez-González, Á., Villamizar, M., Odobez, J.: Pose transformers (POTR): human motion prediction with non-autoregressive transformers. In: ICCVW (2021)

    Google Scholar 

  49. Nichol, A.Q., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In: ICML (2022)

    Google Scholar 

  50. Paden, B., Cáp, M., Yong, S.Z., Yershov, D.S., Frazzoli, E.: A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Trans. Intell. Veh. 1(1), 33–55 (2016)

    Article  Google Scholar 

  51. Perez, E., Strub, F., de Vries, H., Dumoulin, V., Courville, A.C.: Film: Visual reasoning with a general conditioning layer. In: AAAI (2018)

    Google Scholar 

  52. Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI Blog (2019)

    Google Scholar 

  53. Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. In: ICML (2015)

    Google Scholar 

  54. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  55. Ruiz, A.H., Gall, J., Moreno, F.: Human motion prediction via spatio-temporal inpainting. In: ICCV (2019)

    Google Scholar 

  56. Saadatnejad, S., et al.: A generic diffusion-based approach for 3D human pose prediction in the wild. In: ICLR (2023)

    Google Scholar 

  57. Sofianos, T., Sampieri, A., Franco, L., Galasso, F.: Space-time-separable graph convolutional network for pose forecasting. In: ICCV (2021)

    Google Scholar 

  58. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)

    Google Scholar 

  59. Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. In: ICLR (2023)

    Google Scholar 

  60. von Tiesenhausen, J., Artan, U., Marshall, J.A., Li, Q.: Hand gesture-based control of a front-end loader. In: IEEE Canadian Conference on Electrical and Computer Engineering (CCECE) (2020)

    Google Scholar 

  61. Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: video forecasting by generating pose futures. In: ICCV (2017)

    Google Scholar 

  62. Wang, B., Adeli, E., Chiu, H., Huang, D., Niebles, J.C.: Imitation learning for human pose prediction. In: ICCV (2019)

    Google Scholar 

  63. Wang, J.M., Fleet, D.J., Hertzmann, A.: Gaussian process dynamical models for human motion. IEEE TPAMI (2008)

    Google Scholar 

  64. Wang, Z., et al.: Learning to prompt for continual learning. In: CVPR (2022)

    Google Scholar 

  65. Wei, D., et al.: Human joint kinematics diffusion-refinement for stochastic motion prediction. In: AAAI (2023)

    Google Scholar 

  66. Xu, S., Wang, Y., Gui, L.: Diverse human motion prediction guided by multi-level spatial-temporal anchors. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13682. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_15

  67. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI (2018)

    Google Scholar 

  68. Yan, X., et al.: MT-VAE: learning motion transformations to generate multimodal human dynamics. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 276–293. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_17

    Chapter  Google Scholar 

  69. Ye, Y., Gupta, A., Tulsiani, S.: What’s in your hands? 3D reconstruction of generic objects in hands. In: CVPR (2022)

    Google Scholar 

  70. Yuan, Y., Kitani, K.: DLow: diversifying latent flows for diverse human motion prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 346–364. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_20

    Chapter  Google Scholar 

  71. Zand, M., Etemad, A., Greenspan, M.A.: Flow-based spatio-temporal structured prediction of motion dynamics. IEEE TPAMI (2023)

    Google Scholar 

  72. Zhang, J., et al.: Generating human motion from textual descriptions with discrete representations. In: CVPR (2023)

    Google Scholar 

  73. Zhang, M., et al.: MotionDiffuse: text-driven human motion generation with diffusion model. IEEE TPAMI 46(6), 4115–4128 (2024)

    Article  Google Scholar 

  74. Zhang, Y., Black, M.J., Tang, S.: We are more than our joints: predicting how 3D bodies move. In: CVPR (2021)

    Google Scholar 

  75. Zhong, C., Hu, L., Zhang, Z., Ye, Y., Xia, S.: Spatio-temporal gating-adjacency GCN for human motion prediction. In: CVPR (2022)

    Google Scholar 

  76. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: CVPR (2022)

    Google Scholar 

  77. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. IJCV 130, 2337–2348 (2022)

    Article  Google Scholar 

Download references

Acknowledgment

This work is funded in part by the National Natural Science Foundation of China (Grant No. 62372480), in part by ARC-Discovery (DP 220100800), in part by CCF-Tencent Rhino-Bird Open Research Fund (No. CCF-Tencent RAGR20230118).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Kaihao Zhang or Wenhan Luo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tang, B., Zhang, K., Luo, W., Liu, W., Li, H. (2025). Prompting Future Driven Diffusion Model for Hand Motion Prediction. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15065. Springer, Cham. https://doi.org/10.1007/978-3-031-72667-5_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72667-5_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72666-8

  • Online ISBN: 978-3-031-72667-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics