Thus Spake Long-Context Large Language Model

阅读中文版本

This repository provides a collection of papers and resources focused on long-context LLMs, including architecture, infrastructure, training, and evaluation. For a clear taxonomy and more insights about the methodology, you can refer to our survey: Thus Spake Long-Context Large Language Model. In this survey, we give a global picture of the lifecycle of long-context LLMs from four perspectives: architecture, infrastructure, training, and evaluation, including length extrapolation, cache optimization, memory management, architecture innovation, training infrastructure, inference infrastructure, long-context pre-training, long-context post-training, long-context MLLM (mainly long VideoLLM), and long-context evaluation, showcasing the full spectrum of long-context technologies. At the end of this survey, we present 10 unanswered questions currently faced by long-context LLMs.

We appreciate any useful suggestions for improvement of this paper list or survey from peers and commit to regularly updating the repository.

If you would like to include your paper or any modifications in this survey and repository, please feel free to raise issues or send an email to xrliu24@m.fudan.edu.cn. We sincerely appreciate your collaboration!

If you find our survey useful for your research, please consider citing the following paper:

@misc{liu2025spakelongcontextlargelanguage,
      title={Thus Spake Long-Context Large Language Model}, 
      author={Xiaoran Liu and Ruixiao Li and Mianqiu Huang and Zhigeng Liu and Yuerong Song and Qipeng Guo and Siyang He and Qiqi Wang and Linlin Li and Qun Liu and Yaqian Zhou and Xuanjing Huang and Xipeng Qiu},
      year={2025},
      eprint={2502.17129},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.17129}, 
}

News

[2025.03.12] 🎉🎉🎉 We collect papers and blogs mentioned in the survey.
[2025.02.27] 🎉🎉🎉 We release a introducing video about our survey on bilibili.
[2025.02.26] 🎉🎉🎉 We release the report ppt on github.
[2025.02.25] 🎉🎉🎉 Our paper reveices the #1 paper of the day on huggingface.
[2025.02.24] 🎉🎉🎉 We release the first version of the paper on arXiv and huggingface.
[2025.01.29] 🎉🎉🎉 We release the first version of the paper on github.

Paper List

Survey & Technical Report

Advancing transformer architecture in long-context large language models: A comprehensive survey. Yunpeng Huang, Jingwei Xu, Junyu Lai, Zixu Jiang, Taolue Chen, Zenan Li, Yuan Yao, Xiaoxing Ma, Lijuan Yang, Hao Chen, others. arXiv preprint arXiv:2311.12351, 2023
Length extrapolation of transformers: A survey from the perspective of position encoding. Liang Zhao, Xiaocheng Feng, Xiachong Feng, Bin Qin, Ting Liu. arXiv preprint arXiv:2312.17044, 2023
The What, Why, and How of Context Length Extension Techniques in Large Language Models--A Detailed Survey. Saurav Pawar, SM Tonmoy, SM Zaman, Vinija Jain, Aman Chadha, Amitava Das. arXiv preprint arXiv:2401.07872, 2024
MOSS: An Open Conversational Large Language Model. Tianxiang Sun, Xiaotian Zhang, Zhengfu He, Peng Li, Qinyuan Cheng, Xiangyang Liu, Hang Yan, Yunfan Shao, Qiong Tang, Shiduo Zhang, Xingjian Zhao, Ke Chen, Yining Zheng, Zhejian Zhou, Ruixiao Li, Jun Zhan, Yunhua Zhou, Linyang Li, Xiaogui Yang, Lingling Wu, Zhangyue Yin, Xuanjing Huang, Yu-Gang Jiang, Xipeng Qiu. Machine Intelligence Research, 2024
Language models are few-shot learners. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, others. Advances in neural information processing systems, 33, 1877--1901, 2020
GPT-4 Technical Report. OpenAI. arXiv preprint arXiv:2303.08774, 2023
Introducing Claude. Anthropic. 2023
Model Card and Evaluations for Claude Models. Anthropic. 2024
Introducing the next generation of Claude. Anthropic. 2024
Gemini: a family of highly capable multimodal models. Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, others. arXiv preprint arXiv:2312.11805, 2023
Gemma: Open models based on gemini research and technology. Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi{\`e}re, Mihir Sanjay Kale, Juliette Love, others. arXiv preprint arXiv:2403.08295, 2024
Gemma 2: Improving open language models at a practical size. Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L{\'e}onard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram{\'e}, others. arXiv preprint arXiv:2408.00118, 2024
LLaMA 3.3 - 70B Instruct. Meta AI. 2024
tiktoken: A fast BPE tokeniser for use with OpenAI's models. OpenAI. 2023
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, others. arXiv preprint arXiv:2403.05530, 2024
Introducing Gemini 2.0: our new AI model for the agentic era. Demis Hassabis, Koray Kavukcuoglu. 2024
DeepSeek-R1-Lite-Preview is now live: unleashing supercharged reasoning power! DeepSeek. 2024
The falcon series of open language models. Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, M{\'e}rouane Debbah, {\'E}tienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, others. arXiv preprint arXiv:2311.16867, 2023
Baichuan-7B: An Open-Source Large-Scale Pre-trained Model. Baichuan Intelligent Technology. 2023
Falcon2-11B Technical Report. Quentin Malartic, Nilabhra Roy Chowdhury, Ruxandra Cojocaru, Mugariya Farooq, Giulia Campesan, Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Ankit Singh, Maksim Velikanov, Basma El Amel Boussaha, others. arXiv preprint arXiv:2407.14885, 2024
The Falcon 3 family of Open Models. TII Team. 2024
How long can open-source llms truly promise on context length?. Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica, Xuezhe Ma, Hao Zhang. 2023
Llama: Open and efficient foundation language models. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth{\'e}e Lacroix, Baptiste Rozi{\`e}re, Naman Goyal, Eric Hambro, Faisal Azhar, others. arXiv preprint arXiv:2302.13971, 2023
Llama 2: Open foundation and fine-tuned chat models. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, others. arXiv preprint arXiv:2307.09288, 2023
Code llama: Open foundation models for code. Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, others. arXiv preprint arXiv:2308.12950, 2023
Effective Long-Context Scaling of Foundation Models. Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, others. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 4643--4663, 2024
Introducing meta llama 3: The most capable openly available llm to date. AI Meta. Meta AI., 2024
The llama 3 herd of models. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, others. arXiv preprint arXiv:2407.21783, 2024
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models. AI Meta. Meta AI., 2024
Minicpm: Unveiling the potential of small language models with scalable training strategies. Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, others. arXiv preprint arXiv:2404.06395, 2024
World model on million-length video and language with ringattention. Hao Liu, Wilson Yan, Matei Zaharia, Pieter Abbeel. arXiv e-prints, arXiv--2402, 2024
MoonshotAI Kimi. MoonshotAI. 2023
Kimi k1. 5: Scaling reinforcement learning with llms. Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, others. arXiv preprint arXiv:2501.12599, 2025
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, others. arXiv preprint arXiv:2501.12948, 2025
Step-1V. Step Team. 2024
Step-2. Step Team. 2024
ChatGPT: Advanced Language Model by OpenAI. OpenAI. 2022
MiniMax-01: Scaling Foundation Models with Lightning Attention. MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia Wu. 2025
InternLM2 Technical Report. Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li, Shuaibin Li, Wei Li, Yining Li, Hongwei Liu, Jiangning Liu, Jiawei Hong, Kaiwen Liu, Kuikun Liu, Xiaoran Liu, Chengqi Lv, Haijun Lv, Kai Lv, Li Ma, Runyuan Ma, Zerun Ma, Wenchang Ning, Linke Ouyang, Jiantao Qiu, Yuan Qu, Fukai Shang, Yunfan Shao, Demin Song, Zifan Song, Zhihao Sui, Peng Sun, Yu Sun, Huanze Tang, Bin Wang, Guoteng Wang, Jiaqi Wang, Jiayu Wang, Rui Wang, Yudong Wang, Ziyi Wang, Xingjian Wei, Qizhen Weng, Fan Wu, Yingtong Xiong, Chao Xu, Ruiliang Xu, Hang Yan, Yirong Yan, Xiaogui Yang, Haochen Ye, Huaiyuan Ying, Jia Yu, Jing Yu, Yuhang Zang, Chuyu Zhang, Li Zhang, Pan Zhang, Peng Zhang, Ruijie Zhang, Shuo Zhang, Songyang Zhang, Wenjian Zhang, Wenwei Zhang, Xingcheng Zhang, Xinyue Zhang, Hui Zhao, Qian Zhao, Xiaomeng Zhao, Fengzhe Zhou, Zaida Zhou, Jingming Zhuo, Yicheng Zou, Xipeng Qiu, Yu Qiao, Dahua Lin. 2024
Index1.9B Technical Report. 2024
abab7-preview. MiniMax. 2024
Apple intelligence foundation language models. Tom Gunter, Zirui Wang, Chong Wang, Ruoming Pang, Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chung-Cheng Chiu, David Qiu, others. arXiv preprint arXiv:2407.21075, 2024
Phi-3 technical report: A highly capable language model locally on your phone. Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, others. arXiv preprint arXiv:2404.14219, 2024
QwQ: Reflect Deeply on the Boundaries of the Unknown. Qwen Team. 2024
Phi-4 Technical Report. Marah Abdin, Jyoti Aneja, Harkirat Behl, S{\'e}bastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, others. arXiv preprint arXiv:2412.08905, 2024
Yi: Open foundation models by 01. ai. Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, others. arXiv preprint arXiv:2403.04652, 2024
Baichuan 2: Open large-scale language models. Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, others. arXiv preprint arXiv:2309.10305, 2023
Pythia: A suite for analyzing large language models across training and scaling. Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, others. International Conference on Machine Learning, 2397--2430, 2023

Architecture

Length Extrapolation

Self-attention with relative position representations. Peter Shaw, Jakob Uszkoreit, Ashish Vaswani. arXiv preprint arXiv:1803.02155, 2018
Exploring the limits of transfer learning with a unified text-to-text transformer. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu. Journal of machine learning research, 21(140), 1--67, 2020
TENER: adapting transformer encoder for named entity recognition. Hang Yan, Bocao Deng, Xiaonan Li, Xipeng Qiu. arXiv preprint arXiv:1911.04474, 2019
Transformer-xl: Attentive language models beyond a fixed-length context. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, Ruslan Salakhutdinov. arXiv preprint arXiv:1901.02860, 2019
Roformer: Enhanced transformer with rotary position embedding. Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, Yunfeng Liu. Neurocomputing, 568, 127063, Elsevier, 2024
Palm: Scaling language modeling with pathways. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, others. Journal of Machine Learning Research, 24(240), 1--113, 2023
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. Ofir Press, Noah Smith, Mike Lewis. International Conference on Learning Representations, 2022
A length-extrapolatable transformer. Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, Furu Wei. arXiv preprint arXiv:2212.10554, 2022
Retentive network: A successor to transformer for large language models. Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, Furu Wei. arXiv preprint arXiv:2307.08621, 2023
Llm maybe longlm: Self-extend llm context window without tuning. Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Zirui Liu, Chia-Yuan Chang, Huiyuan Chen, Xia Hu. arXiv preprint arXiv:2401.01325, 2024
Longnet: Scaling transformers to 1,000,000,000 tokens. Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Nanning Zheng, Furu Wei. arXiv preprint arXiv:2307.02486, 2023
Infllm: Training-free long-context extrapolation for llms with an efficient context memory. Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Maosong Sun. The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
LongHeads: Multi-Head Attention is Secretly a Long Context Processor. Yi Lu, Xin Zhou, Wei He, Jun Zhao, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang. arXiv preprint arXiv:2402.10685, 2024
ReAttention: Training-Free Infinite Context with Finite Attention Scope. Xiaoran Liu, Ruixiao Li, Qipeng Guo, Zhigeng Liu, Yuerong Song, Kai Lv, Hang Yan, Linlin Li, Qun Liu, Xipeng Qiu. arXiv preprint arXiv:2407.15176, 2024
Training-free long-context scaling of large language models. Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, Lingpeng Kong. arXiv preprint arXiv:2402.17463, 2024
Why Does the Effective Context Length of LLMs Fall Short?. Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, Lingpeng Kong. arXiv preprint arXiv:2410.18745, 2024
Parallel context windows for large language models. Nir Ratner, Yoav Levine, Yonatan Belinkov, Ori Ram, Inbal Magar, Omri Abend, Ehud Karpas, Amnon Shashua, Kevin Leyton-Brown, Yoav Shoham. arXiv preprint arXiv:2212.10947, 2022
Scaling transformer to 1m tokens and beyond with rmt. Aydar Bulatov, Yuri Kuratov, Yermek Kapushev, Mikhail S Burtsev. arXiv preprint arXiv:2304.11062, 2023
XL3M: A Training-free Framework for LLM Length Extension Based on Segment-wise Inference. Shengnan Wang, Youhui Bai, Lin Zhang, Pingyi Zhou, Shixiong Zhao, Gong Zhang, Sen Wang, Renhai Chen, Hua Xu, Hongwei Sun. arXiv preprint arXiv:2405.17755, 2024
LLMxMapReduce: Simplified Long-Sequence Processing using Large Language Models. Zihan Zhou, Chong Li, Xinyi Chen, Shuo Wang, Yu Chao, Zhili Li, Haoyu Wang, Rongqiao An, Qi Shi, Zhixing Tan, others. arXiv preprint arXiv:2410.09342, 2024
LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration. Jun Zhao, Can Zu, Hao Xu, Yi Lu, Wei He, Yiwen Ding, Tao Gui, Qi Zhang, Xuanjing Huang. arXiv preprint arXiv:2402.11550, 2024
Scaling laws of rope-based extrapolation. Xiaoran Liu, Hang Yan, Chenxin An, Xipeng Qiu, Dahua Lin. The Twelfth International Conference on Learning Representations, 2024
Extending context window of large language models via positional interpolation. Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian. arXiv preprint arXiv:2306.15595, 2023
YaRN: Efficient Context Window Extension of Large Language Models. Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enrico Shippole. The Twelfth International Conference on Learning Representations, 2024
Giraffe: Adventures in expanding context lengths in llms. Arka Pal, Deep Karkhanis, Manley Roberts, Samuel Dooley, Arvind Sundararajan, Siddartha Naidu. arXiv preprint arXiv:2308.10882, 2023
Longrope: Extending llm context window beyond 2 million tokens. Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, Mao Yang. arXiv preprint arXiv:2402.13753, 2024
Extending Context Window of Large Language Models from a Distributional Perspective. Yingsheng Wu, Yuxuan Gu, Xiaocheng Feng, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin. arXiv preprint arXiv:2410.01490, 2024
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models. Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia. The Twelfth International Conference on Learning Representations, 2024
Zebra: Extending context window with layerwise grouped local-global attention. Kaiqiang Song, Xiaoyang Wang, Sangwoo Cho, Xiaoman Pan, Dong Yu. arXiv preprint arXiv:2312.08618, 2023
Random-access infinite context length for transformers. Amirkeivan Mohtashami, Martin Jaggi. Advances in Neural Information Processing Systems, 36, 54567--54585, 2023
An Efficient Recipe for Long Context Extension via Middle-Focused Positional Encoding. Tong Wu, Yanpeng Zhao, Zilong Zheng. The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models. Zhiyuan Hu, Yuliang Liu, Jinman Zhao, Suyuchen Wang, Yan Wang, Wei Shen, Qing Gu, Anh Tuan Luu, See-Kiong Ng, Zhiwei Jiang, others. arXiv preprint arXiv:2409.00509, 2024
Focused transformer: Contrastive training for context scaling. Szymon Tworkowski, Konrad Staniszewski, Miko{\l}aj Pacek, Yuhuai Wu, Henryk Michalewski, Piotr Mi{\l}o{\'s}. Advances in Neural Information Processing Systems, 36, 2024
Growlength: Accelerating llms pretraining by progressively growing training length. Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Chia-Yuan Chang, Xia Hu. arXiv preprint arXiv:2310.00576, 2023
E2-LLM: Efficient and Extreme Length Extension of Large Language Models. Jiaheng Liu, Zhiqi Bai, Yuanxing Zhang, Chenchen Zhang, Yu Zhang, Ge Zhang, Jiakai Wang, Haoran Que, Yukang Chen, Wenbo Su, others. arXiv preprint arXiv:2401.06951, 2024
FocusLLM: Scaling LLM's Context by Parallel Decoding. Zhenyu Li, Yike Zhang, Tengyu Pan, Yutao Sun, Zhichao Duan, Junjie Fang, Rong Han, Zixuan Wang, Jianyong Wang. arXiv preprint arXiv:2408.11745, 2024
Pose: Efficient context window extension of llms via positional skip-wise training. Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li. arXiv preprint arXiv:2309.10400, 2023
Randomized positional encodings boost length generalization of transformers. Anian Ruoss, Gr{\'e}goire Del{\'e}tang, Tim Genewein, Jordi Grau-Moya, R{\'o}bert Csord{\'a}s, Mehdi Bennani, Shane Legg, Joel Veness. arXiv preprint arXiv:2305.16843, 2023
CD-Pos: Long Context Generalization in LLMs Through Continuous and Discrete Position Synthesis. Zhiyuan Hu, Yuliang Liu, Jinman Zhao, Suyuchen Wang, Wei Shen, Chao Yin, Bryan Hooi, others. First Workshop on Long-Context Foundation Models@ ICML 2024, 2024
Transformer language models without positional encodings still learn positional information. Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, Omer Levy. arXiv preprint arXiv:2203.16634, 2022
Latent positional information is in the self-attention variance of transformer language models without positional embeddings. Ta-Chung Chi, Ting-Han Fan, Li-Wei Chen, Alexander I Rudnicky, Peter J Ramadge. arXiv preprint arXiv:2305.13571, 2023
The impact of positional encoding on length generalization in transformers. Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, Siva Reddy. Advances in Neural Information Processing Systems, 36, 2024
Length Generalization of Causal Transformers without Position Encoding. Jie Wang, Tao Ji, Yuanbin Wu, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang, Xiaoling Wang. arXiv preprint arXiv:2404.12224, 2024
Kerple: Kernelized relative positional embedding for length extrapolation. Ta-Chung Chi, Ting-Han Fan, Peter J Ramadge, Alexander Rudnicky. Advances in Neural Information Processing Systems, 35, 8386--8399, 2022
Functional Interpolation for Relative Positions improves Long Context Transformers. Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago Ontanon, Manzil Zaheer, Sumit Sanghai, Yiming Yang, Sanjiv Kumar, Srinadh Bhojanapalli. The Twelfth International Conference on Learning Representations, 2024
Dape: Data-adaptive positional encoding for length extrapolation. Chuanyang Zheng, Yihang Gao, Han Shi, Minbin Huang, Jingyao Li, Jing Xiong, Xiaozhe Ren, Michael Ng, Xin Jiang, Zhenguo Li, others. The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
DAPE V2: Process Attention Score as Feature Map for Length Extrapolation. Chuanyang Zheng, Yihang Gao, Han Shi, Jing Xiong, Jiankai Sun, Jingyao Li, Minbin Huang, Xiaozhe Ren, Michael Ng, Xin Jiang, others. arXiv preprint arXiv:2410.04798, 2024
CLEX: Continuous Length Extrapolation for Large Language Models. Guanzheng Chen, Xin Li, Zaiqiao Meng, Shangsong Liang, Lidong Bing. The Twelfth International Conference on Learning Representations, 2024
Contextual Position Encoding: Learning to Count What's Important. Olga Golovneva, Tianlu Wang, Jason Weston, Sainbayar Sukhbaatar. arXiv preprint arXiv:2405.18719, 2024
Extending LLMs' Context Window with 100 Samples. Yikai Zhang, Junlong Li, Pengfei Liu. arXiv preprint arXiv:2401.07004, 2024
Exploring Context Window of Large Language Models via Decomposed Positional Vectors. Zican Dong, Junyi Li, Xin Men, Wayne Xin Zhao, Bingbing Wang, Zhen Tian, Weipeng Chen, Ji-Rong Wen. arXiv preprint arXiv:2405.18009, 2024
HiRoPE: Length Extrapolation for Code Models Using Hierarchical Position. Kechi Zhang, Ge Li, Huangzhao Zhang, Zhi Jin. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 13615--13627, 2024
NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.. bloc97. 2023
Dynamically Scaled RoPE further increases performance of long context LLaMA with zero fine-tuning. bloc97. 2023
Base of RoPE Bounds Context Length. Xin Men, Mingyu Xu, Bingning Wang, Qingyu Zhang, Hongyu Lin, Xianpei Han, Weipeng Chen. arXiv preprint arXiv:2405.14591, 2024
Can Perplexity Reflect Large Language Model's Ability in Long Text Understanding? Yutong Hu, Quzhe Huang, Mingxu Tao, Chen Zhang, Yansong Feng. The Second Tiny Papers Track at ICLR 2024, 2024
What is Wrong with Perplexity for Long-context Language Modeling?. Lizhe Fang, Yifei Wang, Zhaoyang Liu, Chenheng Zhang, Stefanie Jegelka, Jinyang Gao, Bolin Ding, Yisen Wang. arXiv preprint arXiv:2410.23771, 2024
Chatglm: A family of large language models from glm-130b to glm-4 all tools. Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, others. arXiv preprint arXiv:2406.12793, 2024

Kv Cache Optimization

Keep the Cost Down: A Review on Methods to Optimize LLM’s KV-Cache Consumption. Shi Luohe, Hongyi Zhang, Yao Yao, Zuchao Li, others. First Conference on Language Modeling, 2024
Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis. Yao Fu. arXiv preprint arXiv:2405.08944, 2024
Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, Anshumali Shrivastava. Advances in Neural Information Processing Systems, 36, 2024
Mistral 7B. Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, others. arXiv preprint arXiv:2310.06825, 2023
Qwen technical report. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, others. arXiv preprint arXiv:2309.16609, 2023
Efficient Streaming Language Models with Attention Sinks. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis. The Twelfth International Conference on Learning Representations, 2024
LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models. Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, Sinong Wang. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 3991--4008, 2024
H2o: Heavy-hitter oracle for efficient generative inference of large language models. Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R{\'e}, Clark Barrett, others. Advances in Neural Information Processing Systems, 36, 34661--34710, 2023
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference. Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, Beidi Chen. arXiv preprint arXiv:2402.09398, 2024
Model tells you what to discard: Adaptive kv cache compression for llms. Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao. arXiv preprint arXiv:2310.01801, 2023
Transformers are multi-state rnns. Matanel Oren, Michael Hassid, Nir Yarden, Yossi Adi, Roy Schwartz. arXiv preprint arXiv:2401.06104, 2024
Snapkv: Llm knows what you are looking for before generation. Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen. arXiv preprint arXiv:2404.14469, 2024
Razorattention: Efficient kv cache compression through retrieval heads. Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao, Gongyi Wang. arXiv preprint arXiv:2407.15891, 2024
Duoattention: Efficient long-context llm inference with retrieval and streaming heads. Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han. arXiv preprint arXiv:2410.10819, 2024
KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head. Isaac Rehg. arXiv preprint arXiv:2410.00161, 2024
Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, others. arXiv preprint arXiv:2406.02069, 2024
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference. Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, Hai Zhao. arXiv preprint arXiv:2405.12532, 2024
SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction. Xuan Zhang, Cunxiao Du, Chao Du, Tianyu Pang, Wei Gao, Min Lin. arXiv preprint arXiv:2410.13846, 2024
SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation. Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, Deyu Zhou. arXiv preprint arXiv:2412.13649, 2024
Dynamic context pruning for efficient and interpretable autoregressive transformers. Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, Thomas Hofmann. Advances in Neural Information Processing Systems, 36, 2024
Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads. Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, Zhiyuan Liu. arXiv preprint arXiv:2410.01805, 2024
SirLLM: Streaming infinite retentive LLM. Yao Yao, Zuchao Li, Hai Zhao. arXiv preprint arXiv:2405.12528, 2024
A Simple and Effective L/2 Norm-Based Strategy for KV Cache Compression. Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini. arXiv preprint arXiv:2406.11430, 2024
On the efficacy of eviction policy for key-value constrained generative language model inference. Siyu Ren, Kenny Q Zhu. arXiv preprint arXiv:2402.06262, 2024
Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters. Zhiyu Guo, Hidetaka Kamigaito, Taro Watanabe. arXiv preprint arXiv:2406.12335, 2024
Infinipot: Infinite context processing on memory-constrained llms. Minsoo Kim, Kyuhong Shim, Jungwook Choi, Simyung Chang. arXiv preprint arXiv:2410.01518, 2024
Context compression for auto-regressive transformers with sentinel tokens. Siyu Ren, Qi Jia, Kenny Q Zhu. arXiv preprint arXiv:2310.08152, 2023
Long Context Compression with Activation Beacon. Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou. arXiv preprint arXiv:2401.03462, 2024
Anchor-based large language models. Jianhui Pang, Fanghua Ye, Derek Fai Wong, Xin He, Wanshun Chen, Longyue Wang. arXiv preprint arXiv:2402.07616, 2024
Dynamic memory compression: Retrofitting llms for accelerated inference. Piotr Nawrot, Adrian {\L}a{\'n}cucki, Marcin Chochowski, David Tarjan, Edoardo M Ponti. arXiv preprint arXiv:2403.09636, 2024
Model tells you where to merge: Adaptive kv cache merging for llms on long-context tasks. Zheng Wang, Boxiao Jin, Zhongzhi Yu, Minjia Zhang. arXiv preprint arXiv:2407.08454, 2024
CaM: Cache Merging for Memory-efficient LLMs Inference. Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, Rongrong Ji. Forty-first International Conference on Machine Learning, 2024
Long-context language modeling with parallel context encoding. Howard Yen, Tianyu Gao, Danqi Chen. arXiv preprint arXiv:2402.16617, 2024
You only cache once: Decoder-decoder architectures for language models. Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei. arXiv preprint arXiv:2405.05254, 2024
Goldfinch: High performance rwkv/transformer hybrid with linear pre-fill and extreme kv-cache compression. Daniel Goldstein, Fares Obeid, Eric Alcaide, Guangyu Song, Eugene Cheah. arXiv preprint arXiv:2407.12077, 2024
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention. William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly. arXiv preprint arXiv:2405.12981, 2024
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models. Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang. arXiv preprint arXiv:2405.14366, 2024
Layer-Condensed KV Cache for Efficient Inference of Large Language Models. Haoyi Wu, Kewei Tu. arXiv preprint arXiv:2405.10637, 2024
KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing. Yifei Yang, Zouying Cao, Qiguang Chen, Libo Qin, Dongjie Yang, Hai Zhao, Zhi Chen. arXiv preprint arXiv:2410.18517, 2024
SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation. Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, Yuxiong He. arXiv preprint arXiv:2410.03960, 2024
MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding. Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji. arXiv preprint arXiv:2406.09297, 2024
Lossless KV Cache Compression to 2\%. Zhen Yang, JN Han, Kan Wu, Ruobing Xie, An Wang, Xingwu Sun, Zhanhui Kang. arXiv preprint arXiv:2410.15252, 2024
Beyond kv caching: Shared attention for efficient llms. Bingli Liao, Danilo Vasconcellos Vargas. arXiv preprint arXiv:2407.12866, 2024
Gqa: Training generalized multi-query transformer models from multi-head checkpoints. Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr{\'o}n, Sumit Sanghai. arXiv preprint arXiv:2305.13245, 2023
Fast transformer decoding: One write-head is all you need. Noam Shazeer. arXiv preprint arXiv:1911.02150, 2019
Head-wise Shareable Attention for Large Language Models. Zouying Cao, Yifei Yang, Hai Zhao. arXiv preprint arXiv:2402.11819, 2024
DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion. Yilong Chen, Linhao Zhang, Junyuan Shang, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, Yu Sun. arXiv preprint arXiv:2406.06567, 2024
Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, others. arXiv preprint arXiv:2405.04434, 2024
Effectively Compress KV Heads for LLM. Hao Yu, Zelan Yang, Shen Li, Yong Li, Jianxin Wu. arXiv preprint arXiv:2406.07056, 2024
Neurocache: Efficient Vector Retrieval for Long-range Language Modeling. Ali Safaya, Deniz Yuret. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 870--883, 2024
Palu: Compressing KV-Cache with Low-Rank Projection. Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S Abdelfattah, Kai-Chiang Wu. arXiv preprint arXiv:2407.21118, 2024
Eigen Attention: Attention in Low-Rank Space for KV Cache Compression. Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, Kaushik Roy. arXiv preprint arXiv:2408.05646, 2024
MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection. Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, Tianqi Hou, Xiaofeng Gao, Hao Zhang, Zhijie Deng. arXiv preprint arXiv:2410.14731, 2024
LORC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy. Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, Yelong Shen. arXiv preprint arXiv:2410.03111, 2024
Scalable Efficient Training of Large Language Models with Low-dimensional Projected Attention. Xingtai Lv, Ning Ding, Kaiyan Zhang, Ermo Hua, Ganqu Cui, Bowen Zhou. arXiv preprint arXiv:2411.02063, 2024
Think: Thinner key cache by query-driven pruning. Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, Doyen Sahoo. arXiv preprint arXiv:2407.21018, 2024
Kvquant: Towards 10 million context length llm inference with kv cache quantization. Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami. arXiv preprint arXiv:2401.18079, 2024
Kivi: A tuning-free asymmetric 2bit quantization for kv cache. Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu. arXiv preprint arXiv:2402.02750, 2024
No token left behind: Reliable kv cache compression via importance-aware mixed precision quantization. June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, Dongsoo Lee. arXiv preprint arXiv:2402.18096, 2024
SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models. Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, Dahua Lin. arXiv preprint arXiv:2405.06219, 2024
ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification. Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, Bohan Zhuang. arXiv preprint arXiv:2405.14256, 2024
Pqcache: Product quantization-based kvcache for long context llm inference. Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, Bin Cui. arXiv preprint arXiv:2407.12820, 2024
Gear: An efficient kv cache compression recipefor near-lossless generative inference of llm. Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao. arXiv preprint arXiv:2403.05527, 2024
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead. Amir Zandieh, Majid Daliri, Insu Han. arXiv preprint arXiv:2406.03482, 2024
AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations. Qian Tao, Wenyuan Yu, Jingren Zhou. arXiv preprint arXiv:2410.13212, 2024

Memory Management

Memorizing transformers. Yuhuai Wu, Markus N Rabe, DeLesley Hutchins, Christian Szegedy. arXiv preprint arXiv:2203.08913, 2022
Titans: Learning to memorize at test time. Ali Behrouz, Peilin Zhong, Vahab Mirrokni. arXiv preprint arXiv:2501.00663, 2024
Memlong: Memory-augmented retrieval for long text modeling. Weijie Liu, Zecheng Tang, Juntao Li, Kehai Chen, Min Zhang. arXiv preprint arXiv:2408.16967, 2024
Adapting language models to compress contexts. Alexis Chevalier, Alexander Wettig, Anirudh Ajith, Danqi Chen. arXiv preprint arXiv:2305.14788, 2023
LLoCO: Learning Long Contexts Offline. Sijun Tan, Xiuyu Li, Shishir Patil, Ziyang Wu, Tianjun Zhang, Kurt Keutzer, Joseph E Gonzalez, Raluca Ada Popa. arXiv preprint arXiv:2404.07979, 2024
E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning. Zihan Liao, Jun Wang, Hang Yu, Lingxiao Wei, Jianguo Li, Wei Zhang. arXiv preprint arXiv:2409.06679, 2024
UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs. Wenhao Li, Mingbao Lin, Yunshan Zhong, Shuicheng Yan, Rongrong Ji. arXiv preprint arXiv:2406.18173, 2024
Efficient memory management for large language model serving with pagedattention. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, Ion Stoica. Proceedings of the 29th Symposium on Operating Systems Principles, 611--626, 2023
Prompt cache: Modular attention reuse for low-latency inference. In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong. Proceedings of Machine Learning and Systems, 6, 325--338, 2024
Sglang: Efficient execution of structured language model programs. Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, others. arXiv preprint arXiv:2312.07104, 2024
Recurrent memory transformer. Aydar Bulatov, Yury Kuratov, Mikhail Burtsev. Advances in Neural Information Processing Systems, 35, 11079--11091, 2022
CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory. Zexue He, Leonid Karlinsky, Donghyun Kim, Julian McAuley, Dmitry Krotov, Rogerio Feris. arXiv preprint arXiv:2402.13449, 2024
Memory3: Language modeling with explicit memory. Hongkang Yang, Zehao Lin, Wenjin Wang, Hao Wu, Zhiyu Li, Bo Tang, Wenqiang Wei, Jinbo Wang, Zeyun Tang, Shichao Song, others. arXiv preprint arXiv:2407.01178, 2024
Unimem: Towards a unified view of long-context large language models. Junjie Fang, Likai Tang, Hongzhe Bi, Yujia Qin, Si Sun, Zhenyu Li, Haolun Li, Yongjian Li, Xin Cong, Yankai Lin, others. arXiv preprint arXiv:2402.03009, 2024
Memoryllm: Towards self-updatable large language models. Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, others. arXiv preprint arXiv:2402.04624, 2024
Walking down the memory maze: Beyond context limit through interactive reading. Howard Chen, Ramakanth Pasunuru, Jason Weston, Asli Celikyilmaz. arXiv preprint arXiv:2310.05029, 2023
Retrieval-augmented generation for knowledge-intensive nlp tasks. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K{\"u}ttler, Mike Lewis, Wen-tau Yih, Tim Rockt{\"a}schel, others. Advances in Neural Information Processing Systems, 33, 9459--9474, 2020
Retrieval Meets Reasoning: Dynamic In-Context Editing for Long-Text Understanding. Weizhi Fei, Xueyan Niu, Guoqing Xie, Yanhua Zhang, Bo Bai, Lei Deng, Wei Han. arXiv preprint arXiv:2406.12331, 2024
BGE Landmark Embedding: A Chunking-Free Embedding Method For Retrieval Augmented Long-Context Large Language Models. Kun Luo, Zheng Liu, Shitao Xiao, Kang Liu. arXiv preprint arXiv:2402.11573, 2024
LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering. Qingfei Zhao, Ruobing Wang, Yukuo Cen, Daren Zha, Shicheng Tan, Yuxiao Dong, Jie Tang. arXiv preprint arXiv:2410.18050, 2024
Longrag: Enhancing retrieval-augmented generation with long-context llms. Ziyan Jiang, Xueguang Ma, Wenhu Chen. arXiv preprint arXiv:2406.15319, 2024
Long Context vs. RAG for LLMs: An Evaluation and Revisits. Xinze Li, Yixin Cao, Yubo Ma, Aixin Sun. arXiv preprint arXiv:2501.01880, 2024
Retrieval augmented generation or long-context llms? a comprehensive study and hybrid approach. Zhuowan Li, Cheng Li, Mingyang Zhang, Qiaozhu Mei, Michael Bendersky. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, 881--893, 2024
Introducing RAG2. ContextualAI. 2024
You Only Use Reactive Attention Slice For Long Context Retrieval. Yun Joon Soh, Hanxian Huang, Yuandong Tian, Jishen Zhao. arXiv preprint arXiv:2409.13695, 2024
Retrieval meets long context large language models. Peng Xu, Wei Ping, Xianchao Wu, Lawrence McAfee, Chen Zhu, Zihan Liu, Sandeep Subramanian, Evelina Bakhturina, Mohammad Shoeybi, Bryan Catanzaro. arXiv preprint arXiv:2310.03025, 2023
Memorybank: Enhancing large language models with long-term memory. Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, Yanlin Wang. Proceedings of the AAAI Conference on Artificial Intelligence, 19724--19731, 2024
Recurrentgpt: Interactive generation of (arbitrarily) long text. Wangchunshu Zhou, Yuchen Eleanor Jiang, Peng Cui, Tiannan Wang, Zhenxin Xiao, Yifan Hou, Ryan Cotterell, Mrinmaya Sachan. arXiv preprint arXiv:2305.13304, 2023
Explicit Memory Learning with Expectation Maximization. Zhangyue Yin, Qiushi Sun, Qipeng Guo, Zhiyuan Zeng, Qinyuan Cheng, Xipeng Qiu, Xuan-Jing Huang. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 16618--16635, 2024
Memgpt: Towards llms as operating systems. Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, Joseph E Gonzalez. arXiv preprint arXiv:2310.08560, 2023
Compact: Compressing retrieved documents actively for question answering. Chanwoong Yoon, Taewhoo Lee, Hyeon Hwang, Minbyul Jeong, Jaewoo Kang. arXiv preprint arXiv:2407.09014, 2024
From Reading to Compressing: Exploring the Multi-document Reader for Prompt Compression. Eunseong Choi, Sunkyung Lee, Minjin Choi, June Park, Jongwuk Lee. arXiv preprint arXiv:2410.04139, 2024
Perception Compressor: A training-free prompt compression method in long context scenarios. Jiwei Tang, Jin Xu, Tingwei Lu, Zhicheng Zhang, Yiming Zhao, Lin Hai, Hai-Tao Zheng. arXiv preprint arXiv:2409.19272, 2024
Selection-p: Self-Supervised Task-Agnostic Prompt Compression for Faithfulness and Transferability. Tsz Ting Chung, Leyang Cui, Lemao Liu, Xinting Huang, Shuming Shi, Dit-Yan Yeung. arXiv preprint arXiv:2410.11786, 2024
Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu. arXiv preprint arXiv:2310.06839, 2023
Extending context window of large language models via semantic compression. Weizhi Fei, Xueyan Niu, Pingyi Zhou, Lu Hou, Bo Bai, Lei Deng, Wei Han. arXiv preprint arXiv:2312.09571, 2023
In-context autoencoder for context compression in a large language model. Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si-Qing Chen, Furu Wei. arXiv preprint arXiv:2307.06945, 2023

Architecture Innovation

Efficient Attention: Attention with Linear Complexities. Shen Zhuoran, Zhang Mingyuan, Zhao Haiyu, Yi Shuai, Li Hongsheng. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), 3530--3538, 2021
MoBA: Mixture of Block Attention for Long-Context LLMs. Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Neo Y. Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, Jiezhong Qiu. 2025
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng. arXiv preprint arXiv:2502.11089, 2025
Transformers are rnns: Fast autoregressive transformers with linear attention. Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, Fran{\c{c}}ois Fleuret. International conference on machine learning, 5156--5165, 2020
Linformer: Self-attention with linear complexity. Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, Hao Ma. arXiv preprint arXiv:2006.04768, 2020
Rethinking Attention with Performers. Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, others. International Conference on Learning Representations, 2020
Reformer: The Efficient Transformer. Nikita Kitaev, Lukasz Kaiser, Anselm Levskaya. International Conference on Learning Representations, 2019
Generating long sequences with sparse transformers. Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever. arXiv preprint arXiv:1904.10509, 2019
ABC: Attention with Bounded-memory Control. Hao Peng, Jungo Kasai, Nikolaos Pappas, Dani Yogatama, Zhaofeng Wu, Lingpeng Kong, Roy Schwartz, Noah A Smith. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022
Gated Linear Attention Transformers with Hardware-Efficient Training. Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, Yoon Kim. Forty-first International Conference on Machine Learning, 2024
Gated Slot Attention for Efficient Linear-Time Sequence Modeling. Yu Zhang, Songlin Yang, Rui-Jie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, others. The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
Differential transformer. Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei. arXiv preprint arXiv:2410.05258, 2024
SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization. Jialong Guo, Xinghao Chen, Yehui Tang, Yunhe Wang. arXiv preprint arXiv:2405.11582, 2024
Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention. Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, Yiran Zhong. arXiv preprint arXiv:2405.17381, 2024
Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models. Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, Yiran Zhong. arXiv preprint arXiv:2401.04658, 2024
SparQ Attention: Bandwidth-Efficient LLM Inference.(2023). Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr. arXiv preprint cs.LG/2312.04985, 2023
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention. Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Xiuhong Li, Guanyu Feng, Xin Lv, Huanqi Cao, Xiao Chuanfu, Xingcheng Zhang, others. arXiv preprint arXiv:2406.15486, 2024
Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention. Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, others. arXiv preprint arXiv:2407.02490, 2024
Post-Training Sparse Attention with Double Sparsity. Shuo Yang, Ying Sheng, Joseph E Gonzalez, Ion Stoica, Lianmin Zheng. arXiv preprint arXiv:2408.07092, 2024
Retrievalattention: Accelerating long-context llm inference via vector retrieval. Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, others. arXiv preprint arXiv:2409.10516, 2024
Loki: Low-Rank Keys for Efficient Sparse Attention. Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele. arXiv preprint arXiv:2406.02542, 2024
Squeezed Attention: Accelerating Long Context Length LLM Inference. Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Monishwaran Maheswaran, June Paik, Michael W Mahoney, Kurt Keutzer, Amir Gholami. arXiv preprint arXiv:2411.09688, 2024
Selective Attention: Enhancing Transformer through Principled Context Control. Xuechen Zhang, Xiangyu Chang, Mingchen Li, Amit Roy-Chowdhury, Jiasi Chen, Samet Oymak. arXiv preprint arXiv:2411.12892, 2024
Star attention: Efficient llm inference over long sequences. Shantanu Acharya, Fei Jia, Boris Ginsburg. arXiv preprint arXiv:2411.17116, 2024
Longformer: The long-document transformer. Iz Beltagy, Matthew E Peters, Arman Cohan. arXiv preprint arXiv:2004.05150, 2020
Big bidirectional insertion representations for documents. Lala Li, William Chan. arXiv preprint arXiv:1910.13034, 2019
ETC: Encoding long and structured inputs in transformers. Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham, Anirudh Ravula, Sumit Sanghai, Qifan Wang, Li Yang. arXiv preprint arXiv:2004.08483, 2020
Do Efficient Transformers Really Save Computation?. Kai Yang, Jan Ackermann, Zhenyu He, Guhao Feng, Bohang Zhang, Yunzhen Feng, Qiwei Ye, Di He, Liwei Wang. arXiv preprint arXiv:2402.13934, 2024
Gated slot attention for efficient linear-time sequence modeling. Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, others. arXiv preprint arXiv:2409.07146, 2024
Fourier transformer: Fast long range modeling by removing sequence redundancy with fft operator. Ziwei He, Meng Yang, Minwei Feng, Jingcheng Yin, Xinbing Wang, Jingwen Leng, Zhouhan Lin. arXiv preprint arXiv:2305.15099, 2023
Magicpig: Lsh sampling for efficient llm generation. Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, others. arXiv preprint arXiv:2410.16179, 2024
Long short-term memory. Alex Graves, Alex Graves. Supervised sequence labelling with recurrent neural networks, 37--45, Springer, 2012
xLSTM: Extended Long Short-Term Memory. Maximilian Beck, Korbinian P{\"o}ppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, G{\"u}nter Klambauer, Johannes Brandstetter, Sepp Hochreiter. arXiv preprint arXiv:2405.04517, 2024
xLSTM-Mixer: Multivariate Time Series Forecasting by Mixing via Scalar Memories. Maurice Kraus, Felix Divo, Devendra Singh Dhami, Kristian Kersting. arXiv preprint arXiv:2410.16928, 2024
Hierarchically gated recurrent neural network for sequence modeling. Zhen Qin, Songlin Yang, Yiran Zhong. Advances in Neural Information Processing Systems, 36, 2024
Hgrn2: Gated linear rnns with state expansion. Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, Yiran Zhong. arXiv preprint arXiv:2404.07904, 2024
Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, Wang-chun Woo. Advances in neural information processing systems, 28, 2015
Predrnn: A recurrent neural network for spatiotemporal predictive learning. Yunbo Wang, Haixu Wu, Jianjin Zhang, Zhifeng Gao, Jianmin Wang, S Yu Philip, Mingsheng Long. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2), 2208--2225, IEEE, 2022
Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. Yunbo Wang, Zhifeng Gao, Mingsheng Long, Jianmin Wang, S Yu Philip. International conference on machine learning, 5123--5132, 2018
Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. Yunbo Wang, Jianjin Zhang, Hongyu Zhu, Mingsheng Long, Jianmin Wang, Philip S Yu. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9154--9162, 2019
Self-attention convlstm for spatiotemporal prediction. Zhihui Lin, Maomao Li, Zhuobin Zheng, Yangyang Cheng, Chun Yuan. Proceedings of the AAAI conference on artificial intelligence, 11531--11538, 2020
Were RNNs All We Needed?. Leo Feng, Frederick Tung, Mohamed Osama Ahmed, Yoshua Bengio, Hossein Hajimirsadegh. arXiv preprint arXiv:2410.01201, 2024
Rwkv: Reinventing rnns for the transformer era. Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, others. arXiv preprint arXiv:2305.13048, 2023
Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence. Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, others. arXiv preprint arXiv:2404.05892, 2024
HiPPO: Recurrent Memory with Optimal Polynomial Projections. Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, Christopher R{\'e}. Advances in Neural Information Processing Systems, 33, 1474--1487, 2020
Combining recurrent, convolutional, and continuous-time models with linear state space layers. Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, Christopher R{\'e}. Advances in neural information processing systems, 34, 572--585, 2021
Efficiently modeling long sequences with structured state spaces. Albert Gu, Karan Goel, Christopher R{\'e}. arXiv preprint arXiv:2111.00396, 2021
On the parameterization and initialization of diagonal state space models. Albert Gu, Karan Goel, Ankit Gupta, Christopher R{\'e}. Advances in Neural Information Processing Systems, 35, 35971--35983, 2022
Hungry hungry hippos: Towards language modeling with state space models. Daniel Y Fu, Tri Dao, Khaled K Saab, Armin W Thomas, Atri Rudra, Christopher R{\'e}. arXiv preprint arXiv:2212.14052, 2022
Mamba: Linear-time sequence modeling with selective state spaces. Albert Gu, Tri Dao. arXiv preprint arXiv:2312.00752, 2023
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. Tri Dao, Albert Gu. Forty-first International Conference on Machine Learning, 2024
DeciMamba: Exploring the Length Extrapolation Potential of Mamba. Assaf Ben-Kish, Itamar Zimerman, Shady Abu-Hussein, Nadav Cohen, Amir Globerson, Lior Wolf, Raja Giryes. arXiv preprint arXiv:2406.14528, 2024
ReMamba: Equip Mamba with Effective Long-Sequence Modeling. Danlong Yuan, Jiahao Liu, Bei Li, Huishuai Zhang, Jingang Wang, Xunliang Cai, Dongyan Zhao. arXiv preprint arXiv:2408.15496, 2024
Stuffed mamba: State collapse and state capacity of rnn-based long-context modeling. Yingfa Chen, Xinrong Zhang, Shengding Hu, Xu Han, Zhiyuan Liu, Maosong Sun. arXiv preprint arXiv:2410.07145, 2024
SMR: State Memory Replay for Long Sequence Modeling. Biqing Qi, Junqi Gao, Kaiyan Zhang, Dong Li, Jianxing Liu, Ligang Wu, Bowen Zhou. arXiv preprint arXiv:2405.17534, 2024
Mamba-ptq: Outlier channels in recurrent large language models. Alessandro Pierro, Steven Abreu. arXiv preprint arXiv:2407.12397, 2024
The mamba in the llama: Distilling and accelerating hybrid models. Junxiong Wang, Daniele Paliotta, Avner May, Alexander M Rush, Tri Dao. arXiv preprint arXiv:2408.15237, 2024
Falcon mamba: The first competitive attention-free 7b language model. Jingwei Zuo, Maksim Velikanov, Dhia Eddine Rhaiem, Ilyas Chahed, Younes Belkada, Guillaume Kunsch, Hakim Hacid. arXiv preprint arXiv:2410.05355, 2024
Jamba: A hybrid transformer-mamba language model. Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, others. arXiv preprint arXiv:2403.19887, 2024
Jamba-1.5: Hybrid transformer-mamba models at scale. Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, others. arXiv preprint arXiv:2408.12570, 2024
RecurFormer: Not All Transformer Heads Need Self-Attention. Ruiqing Yan, Linghan Zheng, Xingbo Du, Han Zou, Yufeng Guo, Jianfei Yang. arXiv preprint arXiv:2410.12850, 2024
Hymba: A Hybrid-head Architecture for Small Language Models. Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, others. arXiv preprint arXiv:2411.13676, 2024
Attamba: Attending To Multi-Token States. Yash Akhauri, Safeen Huda, Mohamed S Abdelfattah. arXiv preprint arXiv:2411.17685, 2024
Neural ordinary differential equations. Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, David K Duvenaud. Advances in neural information processing systems, 31, 2018
Liquid time-constant networks. Ramin Hasani, Mathias Lechner, Alexander Amini, Daniela Rus, Radu Grosu. Proceedings of the AAAI Conference on Artificial Intelligence, 7657--7666, 2021
Closed-form continuous-time neural networks. Ramin Hasani, Mathias Lechner, Alexander Amini, Lucas Liebenwein, Aaron Ray, Max Tschaikowski, Gerald Teschl, Daniela Rus. Nature Machine Intelligence, 4(11), 992--1003, Nature Publishing Group UK London, 2022
MixCon: A Hybrid Architecture for Efficient and Adaptive Sequence Modeling. Xin Xu, Zhouchen Lin. ECAI 2024, 1027--1034, IOS Press, 2024
MCSD: An Efficient Language Model with Diverse Fusion. Hua Yang, Duohai Li, Shiman Li. arXiv preprint arXiv:2406.12230, 2024

Infrastructure

Training Infrastructure

**AI and memory wall.** Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W Mahoney, Kurt Keutzer. IEEE Micro, IEEE, 2024
Qwen2 Technical Report. An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Xuejing Liu, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zhifang Guo, Zhihao Fan. arXiv preprint arXiv:2407.10671, 2024
Qwen2.5 Technical Report. Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan Qiu. arXiv preprint arXiv:2412.15115, 2024
Pytorch distributed: Experiences on accelerating data parallel training. Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, others. arXiv preprint arXiv:2006.15704, 2020
SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile. Ruisi Zhang, Tianyu Liu, Will Feng, Andrew Gu, Sanket Purandare, Wanchao Liang, Francisco Massa. ArXiv, abs/2411.00284, 2024
CO2: Efficient distributed training with full communication-computation overlap. Weigao Sun, Zhen Qin, Weixuan Sun, Shidi Li, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong. arXiv preprint arXiv:2401.16265, 2024
PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training. Daiyaan Arfeen, Zhen Zhang, Xinwei Fu, Gregory R. Ganger, Yida Wang. arXiv preprint arXiv:2410.07192, 2024
TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models. Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Xiaodong Song, Ion Stoica. International Conference on Machine Learning, 2021
Zero Bubble (Almost) Pipeline Parallelism. Penghui Qi, Xinyi Wan, Guangxing Huang, Min Lin. International Conference on Learning Representations, 2024
Towards Low-bit Communication for Tensor Parallel LLM Inference. Harry Dong, Tyler Johnson, Minsik Cho, Emad Soroush. arXiv preprint arXiv:2411.07942, 2024
Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training. Ao Sun, Weilin Zhao, Xu Han, Cheng Yang, Xinrong Zhang, Zhiyuan Liu, Chuan Shi, Maosong Sun. arXiv preprint arXiv:2406.03488, 2024
Sequence parallelism: Long sequence training from system perspective. Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, Yang You. arXiv preprint arXiv:2105.13120, 2021
A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models. Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, others. arXiv preprint arXiv:2410.07265, 2024
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro. ArXiv, abs/1909.08053, 2019
**Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM.** Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, Matei A. Zaharia. SC21: International Conference for High Performance Computing, Networking, Storage and Analysis, 1-14, 2021
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training. Zhengda Bian, Hongxin Liu, Boxiang Wang, Haichen Huang, Yongbin Li, Chuan-Qing Wang, Fan Cui, Yang You. Proceedings of the 52nd International Conference on Parallel Processing, 2021
Ring attention with blockwise transformers for near-infinite context. Hao Liu, Matei Zaharia, Pieter Abbeel. arXiv preprint arXiv:2310.01889, 2023
Striped Attention: Faster Ring Attention for Causal Transformers. William Brandon, Aniruddha Nrusimha, Kevin Qian, Zack Ankner, Tian Jin, Zhiye Song, Jonathan Ragan-Kelley. arXiv preprint arXiv:2311.09431, 2023
Fp8-lm: Training fp8 large language models. Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu, others. arXiv preprint arXiv:2310.18313, 2023
USP: A Unified Sequence Parallelism Approach for Long Context Generative AI. Jiarui Fang, Shangchun Zhao. arXiv preprint arXiv:2405.07719, 2024
Loongtrain: Efficient training of long-sequence llms with head-context parallelism. Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Yingtong Xiong, Guoteng Wang, Qiaoling Chen, Shangchun Zhao, Jiarui Fang, others. arXiv preprint arXiv:2406.18485, 2024
**Distflashattn: Distributed memory-efficient attention for long-context llms training.** Dacheng Li, Rulin Shao, Anze Xie, Eric P Xing, Xuezhe Ma, Ion Stoica, Joseph E Gonzalez, Hao Zhang. First Conference on Language Modeling, 2024
Internevo: Efficient long-sequence large language model training via hybrid parallelism and redundant sharding. Qiaoling Chen, Diandian Gu, Guoteng Wang, Xun Chen, YingTong Xiong, Ting Huang, Qinghao Hu, Xin Jin, Yonggang Wen, Tianwei Zhang, others. arXiv preprint arXiv:2401.09149, 2024
Flashattention: Fast and memory-efficient exact attention with io-awareness. Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, Christopher R{\'e}. Advances in Neural Information Processing Systems, 35, 16344--16359, 2022
**FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.** Tri Dao. The Twelfth International Conference on Learning Representations, 2024
Flashattention-3: Fast and accurate attention with asynchrony and low-precision. Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao. arXiv preprint arXiv:2407.08608, 2024
Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, Yuxiong He. arXiv preprinrt arXiv:2309.14509, 2023
Linear Attention Sequence Parallelism. Weigao Sun, Zhen Qin, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong. arXiv preprint arXiv:2404.02882, 2024
A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs. Siddharth Singh, Prajwal Singhania, Aditya K Ranjan, Zack Sating, Abhinav Bhatele. arXiv preprint arXiv:2305.13525, 2024
Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator. Kazuki Fujii, Kohei Watanabe, Rio Yokota. arXiv preprint arXiv:2411.06465, 2024
Efficient training of large language models on distributed infrastructures: A survey. Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, others. arXiv preprint arXiv:2407.20018, 2024
TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training. Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, Stratos Idreos. arXiv preprint arXiv:2410.06511, 2024
Distributed training of large language models on AWS Trainium. Xinwei Fu, Zhen Zhang, Haozheng Fan, Guangtai Huang, Mohammad El-Shabani, Randy Huang, Rahul Solanki, Fei Wu, Ron Diamant, Yida Wang. Proceedings of the 2024 ACM Symposium on Cloud Computing, 961--976, 2024
Training deep nets with sublinear memory cost. Tianqi Chen, Bing Xu, Chiyuan Zhang, Carlos Guestrin. arXiv preprint arXiv:1604.06174, 2016
Optimizing Large Model Training through Overlapped Activation Recomputation. Ping Chen, Wenjie Zhang, Shuibing He, Yingjie Gu, Zhuwei Peng, Kexin Huang, Xuan Zhan, Weijian Chen, Yi Zheng, Zhefeng Wang, others. arXiv preprint arXiv:2406.08756, 2024
Reducing activation recomputation in large transformer models. Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, Bryan Catanzaro. Proceedings of Machine Learning and Systems, 5, 341--353, 2023
Dynamic Tensor Rematerialization. Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, Zachary Tatlock. ArXiv, abs/2006.09616, 2020
MegTaiChi: dynamic tensor-based memory management optimization for DNN training. Zhongzhe Hu, Junmin Xiao, Zheye Deng, Mingyi Li, Kewei Zhang, Xiaoyang Zhang, Ke Meng, Ninghui Sun, Guangming Tan. Proceedings of the 36th ACM International Conference on Supercomputing, 2022
Coop: Memory is not a Commodity. Jianhao Zhang, Shihan Ma, Peihong Liu, Jinhui Yuan. Advances in Neural Information Processing Systems, 36, 2024
Efficiently Training 7B LLM with 1 Million Sequence Length on 8 GPUs. Pinxue Zhao, Hailin Zhang, Fangcheng Fu, Xiaonan Nie, Qibin Liu, Fang Yang, Yuanbo Peng, Dian Jiao, Shuaipeng Li, Jinbao Xue, Yangyu Tao, Bin Cui. arXiv preprint arXiv:2407.12117, 2024
Zero: Memory optimizations toward training trillion parameter models. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 1--16, 2020
Zero++: Extremely efficient collective communication for giant model training. Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Connor Holmes, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, Yuxiong He. arXiv preprint arXiv:2306.10209, 2023
Pytorch fsdp: experiences on scaling fully sharded data parallel. Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, others. arXiv preprint arXiv:2304.11277, 2023
MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud. Zhen Zhang, Shuai Zheng, Yida Wang, Justin Chiu, George Karypis, Trishul M. Chilimbi, Mu Li, Xin Jin. Proc. VLDB Endow., 16, 37-50, 2022
RTP: Rethinking Tensor Parallelism with Memory Deduplication. Cheng Luo, Tianle Zhong, Geoffrey Fox. arXiv preprint arXiv:2311.01635, 2023
Lins: Reducing Communication Overhead of ZeRO for Efficient LLM Training. Qiaoling Chen, Qinghao Hu, Guoteng Wang, Yingtong Xiong, Ting Huang, Xun Chen, Yang Gao, Hang Yan, Yonggang Wen, Tianwei Zhang, others. 2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQoS), 1--10, 2024
Rethinking memory and communication cost for efficient large language model training. Chan Wu, Hanxiao Zhang, Lin Ju, Jinjing Huang, Youshao Xiao, Zhaoxin Huan, Siyuan Li, Fanzhuang Meng, Lei Liang, Xiaolu Zhang, others. arXiv preprint arXiv:2310.06003, 2023
ROAM: memory-efficient large DNN training via optimized operator ordering and memory layout. Huiyao Shu, Ang Wang, Ziji Shi, Hanyu Zhao, Yong Li, Lu Lu. arXiv preprint arXiv:2310.19295, 2023
GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching. Cong Guo, Rui Zhang, Jiale Xu, Jingwen Leng, Zihan Liu, Ziyu Huang, Minyi Guo, Hao Wu, Shouren Zhao, Junping Zhao, others. Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 450--466, 2024
Training large neural networks with constant memory using a new execution algorithm. Bharadwaj Pudipeddi, Maral Mesmakhosroshahi, Jinwen Xi, Sujeeth Bharadwaj. arXiv preprint arXiv:2002.05645, 2020
Zero-offload$\$: Democratizing $\$billion-scale$\$ model training. Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He. 2021 USENIX Annual Technical Conference (USENIX ATC 21), 551--564, 2021
Stronghold: fast and affordable billion-scale deep learning model training. Xiaoyang Sun, Wei Wang, Shenghao Qiu, Renyu Yang, Songfang Huang, Jie Xu, Zheng Wang. SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, 1--17, 2022
Harmony: Overcoming the hurdles of gpu memory capacity to train massive dnn models on commodity servers. Youjie Li, Amar Phanishayee, Derek Murray, Jakub Tarnawski, Nam Sung Kim. arXiv preprint arXiv:2202.01306, 2022
Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He. Proceedings of the international conference for high performance computing, networking, storage and analysis, 1--14, 2021
Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU. Changyue Liao, Mo Sun, Zihan Yang, Kaiqi Chen, Binhang Yuan, Fei Wu, Zeke Wang. arXiv preprint arXiv:2403.06504, 2024
Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer. Jinghan Yao, Sam Ade Jacobs, Masahiro Tanaka, Olatunji Ruwase, Aamir Shafi, Hari Subramoni, Dhabaleswar K Panda. arXiv preprint arXiv:2408.16978, 2024
Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System. Hongsun Jang, Jaeyong Song, Jaewon Jung, Jaeyoung Park, Youngsok Kim, Jinho Lee. 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 345--360, 2024
Reducing the Cost of Dropout in Flash-Attention by Hiding RNG with GEMM. Haiyue Ma, Jian Liu, Ronny Krashinsky. ArXiv, abs/2410.07531, 2024
Triton: an intermediate language and compiler for tiled neural network computations. Philippe Tillet, Hsiang-Tsung Kung, David Cox. Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 10--19, 2019
MLIR: A compiler infrastructure for the end of Moore's law. Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, Oleksandr Zinenko. arXiv preprint arXiv:2002.11054, 2020
Flex Attention: A Programming Model for Generating Optimized Attention Kernels. Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, Horace He. arXiv preprint arXiv:2412.05496, 2024
ThunderKittens: Simple, Fast, and Adorable AI Kernels. Benjamin F. Spector, Simran Arora, Aaryan Singhal, Daniel Y. Fu, Christopher R'e. ArXiv, abs/2410.20399, 2024
TVM$\$: An automated $\$End-to-End$\$ optimizing compiler for deep learning. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, others. 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 578--594, 2018
PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael Lazos, Mario Lezcano, Yanbo Liang, Jason Liang, Yinghai Lu, C. K. Luk, Bert Maher, Yunjie Pan, Christian Puhrsch, Matthias Reso, Mark Saroufim, Marcos Yukio Siraichi, Helen Suk, Shunting Zhang, Michael Suo, Phil Tillet, Xu Zhao, Eikan Wang, Keren Zhou, Richard Zou, Xiaodong Wang, Ajit Mathews, William Wen, Gregory Chanan, Peng Wu, Soumith Chintala. Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 929–947, Association for Computing Machinery, 2024
A Multi-Level Superoptimizer for Tensor Programs. Mengdi Wu, Xinhao Cheng, Oded Padon, Zhihao Jia. arXiv preprint arXiv:2405.05751, 2024
Hiding Communication Cost in Distributed LLM Training via Micro-batch Co-execution. Haiquan Wang, Chaoyi Ruan, Jia He, Jiaqi Ruan, Chengjie Tang, Xiaosong Ma, Cheng Li. arXiv preprint arXiv:2411.15871, 2024
A Codesign of Scheduling and Parallelization for Large Model Training in Heterogeneous Clusters. Chunyu Xue, Weihao Cui, Han Zhao, Quan Chen, Shulai Zhang, Pengyu Yang, Jing Yang, Shaobo Li, Minyi Guo. arXiv preprint arXiv:2403.16125, 2024
Hydro:$\$Surrogate-Based$\$ Hyperparameter Tuning Service in Datacenters. Qinghao Hu, Zhisheng Ye, Meng Zhang, Qiaoling Chen, Peng Sun, Yonggang Wen, Tianwei Zhang. 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), 757--777, 2023
Characterization of large language model development in the datacenter. Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, others. 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 709--729, 2024
Silod: A co-design of caching and scheduling for deep learning clusters. Hanyu Zhao, Zhenhua Han, Zhi Yang, Quanlu Zhang, Mingxia Li, Fan Yang, Qianxi Zhang, Binyang Li, Yuqing Yang, Lili Qiu, others. Proceedings of the Eighteenth European Conference on Computer Systems, 883--898, 2023
Mixed precision training. Sharan Narang, Gregory Diamos, Erich Elsen, Paulius Micikevicius, Jonah Alben, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, others. Int. Conf. on Learning Representation, 2017
A study of BFLOAT16 for deep learning training. Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, others. arXiv preprint arXiv:1905.12322, 2019
Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks. Xiao Sun, Jungwook Choi, Chia-Yu Chen, Naigang Wang, Swagath Venkataramani, Vijayalakshmi Viji Srinivasan, Xiaodong Cui, Wei Zhang, Kailash Gopalakrishnan. Advances in neural information processing systems, 32, 2019
Training transformers with 4-bit integers. Haocheng Xi, Changhao Li, Jianfei Chen, Jun Zhu. Advances in Neural Information Processing Systems, 36, 49146--49168, 2023
Jetfire: Efficient and accurate transformer pretraining with int8 data flow and per-block quantization. Haocheng Xi, Yuxiang Chen, Kang Zhao, Kai Jun Teh, Jianfei Chen, Jun Zhu. arXiv preprint arXiv:2403.12422, 2024
Bitnet: Scaling 1-bit transformers for large language models. Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, Furu Wei. arXiv preprint arXiv:2310.11453, 2023

Inference Infrastructure

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried. arXiv preprint arXiv:2401.13649, 2024
Llm inference unveiled: Survey and roofline model insights. Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, others. arXiv preprint arXiv:2402.16363, 2024
LightLLM: A Versatile Large Language Model for Predictive Light Sensing. Jiawei Hu, Hong Jia, Mahbub Hassan, Lina Yao, Brano Kusy, Wen Hu. arXiv preprint arXiv:2411.15211, 2024
Llm inference serving: Survey of recent advances and opportunities. Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari. arXiv preprint arXiv:2407.12391, 2024
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention. Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar. arXiv preprint arXiv:2405.04437, 2024
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving. Jiale Xu, Rui Zhang, Cong Guo, Weiming Hu, Zihan Liu, Feiyang Wu, Yu Feng, Shixuan Sun, Changxu Shao, Yuhong Guo, others. arXiv preprint arXiv:2407.15309, 2024
Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference. Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, others. arXiv preprint arXiv:2401.08671, 2024
Taming $\$Throughput-Latency$\$ Tradeoff in $\$LLM$\$ Inference with $\$Sarathi-Serve$\$. Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, Ramachandran Ramjee. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 117--134, 2024
Memorize step by step: Efficient long-context prefilling with incremental memory and decremental chunk. Zhiyuan Zeng, Qipeng Guo, Xiaoran Liu, Zhangyue Yin, Wentao Shu, Mianqiu Huang, Bo Wang, Yunhua Zhou, Linlin Li, Qun Liu, others. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 21021--21034, 2024
Flexgen: High-throughput generative inference of large language models with a single gpu. Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher R{\'e}, Ion Stoica, Ce Zhang. International Conference on Machine Learning, 31094--31116, 2023
FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines. Jiaao He, Jidong Zhai. arXiv preprint arXiv:2403.11421, 2024
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference. Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu. arXiv preprint arXiv:2411.01142, 2024
AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving. Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo. arXiv preprint arXiv:2403.19708, 2024
InfiniGen$\$: Efficient generative inference of large language models with dynamic $\$KV$\$ cache management. Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 155--172, 2024
Deja vu: Contextual sparsity for efficient llms at inference time. Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, others. International Conference on Machine Learning, 22137--22176, 2023
Llm in a flash: Efficient large language model inference with limited memory. Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar. arXiv preprint arXiv:2312.11514, 2023
Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, others. SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, 1--15, 2022
Accelerate: Training and inference at scale made simple, efficient and adaptable. Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, Benjamin Bossan. 2022
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving. Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, others. arXiv preprint arXiv:2501.01005, 2025
Hydragen: High-Throughput LLM Inference with Shared Prefixes. Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y Fu, Christopher R{\'e}, Azalia Mirhoseini. arXiv preprint arXiv:2402.05099, 2024
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable. Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, Lili Qiu. arXiv preprint arXiv:2405.19888, 2024
Flashdecoding++: Faster large language model inference on gpus. Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, Yu Wang. arXiv preprint arXiv:2311.01282, 2023
Balanced Data Placement for GEMV Acceleration with Processing-In-Memory. Mohamed Assem Ibrahim, Mahzabeen Islam, Shaizeen Aga. arXiv preprint arXiv:2403.20297, 2024
Orca: A Distributed Serving System for Transformer-Based Generative Models. Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, Byung-Gon Chun. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 521--538, USENIX Association, 2022
EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models. Junhao Hu, Wenrui Huang, Haoyi Wang, Weidong Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, Tao Xie. arXiv preprint arXiv:2410.15332, 2024
CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion. Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang. arXiv preprint arXiv:2405.16444, 2024
Context Parallelism for Scalable Million-Token Inference. Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jongsoo Park, Jianyu Huang, others. arXiv preprint arXiv:2411.01783, 2024
Llumnix: Dynamic Scheduling for Large Language Model Serving. Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, Wei Lin. arXiv preprint arXiv:2406.03243, 2024
Splitwise: Efficient generative llm inference using phase splitting. Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, {\'I}{\~n}igo Goiri, Saeed Maleki, Ricardo Bianchini. 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), 118--132, 2024
$$$$DistServe$\$: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 193--210, 2024
Mooncake: A kvcache-centric disaggregated architecture for llm serving. Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu. arXiv preprint arXiv:2407.00079, 2024
Inference without interference: Disaggregate llm inference for mixed downstream workloads. Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, others. arXiv preprint arXiv:2401.11181, 2024
P/d-serve: Serving disaggregated large language model at scale. Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, others. arXiv preprint arXiv:2408.08147, 2024
Memserve: Context caching for disaggregated llm serving with elastic memory pool. Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, others. arXiv preprint arXiv:2406.17565, 2024
Preble: Efficient Distributed Prompt Scheduling for LLM Serving. Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, Yiying Zhang. arXiv preprint arXiv:2407.00023, 2024
Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache. Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, others. arXiv preprint arXiv:2401.02669, 2024
Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism. Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, Xin Jin. Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, 640--654, 2024
Powerinfer: Fast large language model serving with a consumer-grade gpu. Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen. Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, 590--606, 2024
PowerInfer-2: Fast Large Language Model Inference on a Smartphone. Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, Haibo Chen. arXiv preprint arXiv:2406.06282, 2024
**LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM.** LMDeploy Contributors. 2023
On-device language models: A comprehensive review. Jiajun Xu, Zhiyuan Li, Wei Chen, Qun Wang, Xin Gao, Qi Cai, Ziyuan Ling. arXiv preprint arXiv:2409.00088, 2024
Generalizing an LLM from 8k to 1M Context using Qwen-Agent. Qwen Team. 2024
A First Look at LLM-powered Smartphones. Liangxuan Wu, Yanjie Zhao, Chao Wang, Tianming Liu, Haoyu Wang. Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering Workshops, 208--217, 2024
Llm as a system service on mobile devices. Wangsong Yin, Mengwei Xu, Yuanchun Li, Xuanzhe Liu. arXiv preprint arXiv:2403.11805, 2024
Bluelm-v-3b: Algorithm and system co-design for multimodal large language models on mobile devices. Xudong Lu, Yinghao Chen, Cheng Chen, Hui Tan, Boheng Chen, Yina Xie, Rui Hu, Guanxin Tan, Renshou Wu, Yan Hu, others. arXiv preprint arXiv:2411.10640, 2024
Minicpm-v: A gpt-4v level mllm on your phone. Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, others. arXiv preprint arXiv:2408.01800, 2024
RWKV-edge: Deeply Compressed RWKV for Resource-Constrained Devices. Wonkyo Choe, Yangfeng Ji, Felix Lin. arXiv preprint arXiv:2412.10856, 2024

Training

Long-Context Pre-Training

Data Engineering for Scaling Language Models to 128K Context. Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, Hao Peng. Forty-first International Conference on Machine Learning, 2024
Longwanjuan: Towards systematic measurement for long text quality. Kai Lv, Xiaoran Liu, Qipeng Guo, Hang Yan, Conghui He, Xipeng Qiu, Dahua Lin. arXiv preprint arXiv:2402.13583, 2024
Long Context is Not Long at All: A Prospector of Long-Dependency Data for Large Language Models. Longze Chen, Ziqiang Liu, Wanwei He, Yunshui Li, Run Luo, Min Yang. arXiv preprint arXiv:2405.17915, 2024
How to train long-context language models (effectively). Tianyu Gao, Alexander Wettig, Howard Yen, Danqi Chen. arXiv preprint arXiv:2410.02660, 2024
In-Context Pretraining: Language Modeling Beyond Document Boundaries. Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Xi Victoria Lin, Noah A Smith, Luke Zettlemoyer, Wen-tau Yih, Mike Lewis. The Twelfth International Conference on Learning Representations, 2024
Structured packing in llm training improves long context utilization. Konrad Staniszewski, Szymon Tworkowski, Sebastian Jaszczur, Yu Zhao, Henryk Michalewski, {\L}ukasz Kuci{\'n}ski, Piotr Mi{\l}o{\'s}. arXiv preprint arXiv:2312.17296, 2023
Analysing The Impact of Sequence Composition on Language Model Pre-Training. Yu Zhao, Yuanbin Qu, Konrad Staniszewski, Szymon Tworkowski, Wei Liu, Piotr Mi{\l}o{\'s}, Yuxiang Wu, Pasquale Minervini. arXiv preprint arXiv:2402.13991, 2024
Quest: Query-centric Data Synthesis Approach for Long-context Scaling of Large Language Model. Chaochen Gao, Xing Wu, Qi Fu, Songlin Hu. arXiv preprint arXiv:2405.19846, 2024
DataSculpt: Crafting Data Landscapes for Long-Context LLMs through Multi-Objective Partitioning. Keer Lu, Xiaonan Nie, Zheng Liang, Da Pan, Shusen Zhang, Keshi Zhao, Weipeng Chen, Zenan Zhou, Guosheng Dong, Bin Cui, others. arXiv preprint arXiv:2409.00997, 2024
LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models. Liang Zhao, Tianwen Wei, Liang Zeng, Cheng Cheng, Liu Yang, Peng Cheng, Lijie Wang, Chenxia Li, Xuejie Wu, Bo Zhu, others. arXiv preprint arXiv:2406.00605, 2024
Untie the Knots: An Efficient Data Augmentation Strategy for Long-Context Pre-Training in Language Models. Junfeng Tian, Da Zheng, Yang Cheng, Rui Wang, Colin Zhang, Debing Zhang. arXiv preprint arXiv:2409.04774, 2024

Long-Context Post-Training

Training With “Paraphrasing the Original Text” Teaches LLM to Better Retrieve in Long-context Tasks. Yijiong Yu, Yongfeng Huang, Zhixiao Qi, Zhe Zhou. arXiv preprint arXiv:2312.11193, 2023
LongForm: Effective Instruction Tuning with Reverse Instructions. Abdullatif K{\"o}ksal, Timo Schick, Anna Korhonen, Hinrich Sch{\"u}tze. arXiv preprint arXiv:2304.08460, 2023
Longalign: A recipe for long context alignment of large language models. Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, Juanzi Li. arXiv preprint arXiv:2401.18058, 2024
MDCure: A Scalable Pipeline for Multi-Document Instruction-Following. Gabrielle Kaili-May Liu, Bowen Shi, Avi Caciularu, Idan Szpektor, Arman Cohan. arXiv e-prints, arXiv--2410, 2024
Make Your LLM Fully Utilize the Context. Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou. arXiv preprint arXiv:2404.16811, 2024
What are the essential factors in crafting effective long context multi-hop instruction datasets? insights and best practices. Zhi Chen, Qiguang Chen, Libo Qin, Qipeng Guo, Haijun Lv, Yicheng Zou, Wanxiang Che, Hang Yan, Kai Chen, Dahua Lin. arXiv preprint arXiv:2409.01893, 2024
Longcite: Enabling llms to generate fine-grained citations in long-context qa. Jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, others. arXiv preprint arXiv:2409.02897, 2024
Selecting Influential Samples for Long Context Alignment via Homologous Models' Guidance and Contextual Awareness Measurement. Shuzheng Si, Haozhe Zhao, Gang Chen, Yunshui Li, Kangyang Luo, Chuancheng Lv, Kaikai An, Fanchao Qi, Baobao Chang, Maosong Sun. arXiv preprint arXiv:2410.15633, 2024
Making long-context language models better multi-hop reasoners. Yanyang Li, Shuo Liang, Michael R Lyu, Liwei Wang. arXiv preprint arXiv:2408.03246, 2024
Long context alignment with short instructions and synthesized positions. Wenhao Wu, Yizhong Wang, Yao Fu, Xiang Yue, Dawei Zhu, Sujian Li. arXiv preprint arXiv:2405.03939, 2024
Long Context Understanding using Self-Generated Synthetic Data. Jerry Li, Subhro Das, Aude Oliva, Dmitry Krotov, Leonid Karlinsky, Rogerio Feris. First Workshop on Long-Context Foundation Models@ ICML 2024, NA
LOGO--Long cOntext aliGnment via efficient preference Optimization. Zecheng Tang, Zechen Sun, Juntao Li, Qiaoming Zhu, Min Zhang. arXiv preprint arXiv:2410.18533, 2024
LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization. Anonymous. Submitted to The Thirteenth International Conference on Learning Representations, 2024
LongReward: Improving Long-context Large Language Models with AI Feedback. Jiajie Zhang, Zhongni Hou, Xin Lv, Shulin Cao, Zhenyu Hou, Yilin Niu, Lei Hou, Yuxiao Dong, Ling Feng, Juanzi Li. arXiv preprint arXiv:2410.21252, 2024
Suri: Multi-constraint instruction following for long-form text generation. Chau Minh Pham, Simeng Sun, Mohit Iyyer. arXiv preprint arXiv:2406.19371, 2024
Longwriter: Unleashing 10,000+ word generation from long context llms. Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li. arXiv preprint arXiv:2408.07055, 2024
ConTReGen: Context-driven Tree-structured Retrieval for Open-domain Long-form Text Generation. Kashob Kumar Roy, Pritom Saha Akash, Kevin Chen-Chuan Chang, Lucian Popa. arXiv preprint arXiv:2410.15511, 2024
Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective. Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Bo Wang, Shimin Li, Yunhua Zhou, Qipeng Guo, Xuanjing Huang, Xipeng Qiu. arXiv preprint arXiv:2412.14135, 2024
O1 Replication Journey: A Strategic Progress Report--Part 1. Yiwei Qin, Xuefeng Li, Haoyang Zou, Yixiu Liu, Shijie Xia, Zhen Huang, Yixin Ye, Weizhe Yuan, Hector Liu, Yuanzhi Li, others. arXiv preprint arXiv:2410.18982, 2024
O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?. Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, Pengfei Liu. arXiv preprint arXiv:2411.16489, 2024
Language Models can Self-Lengthen to Generate Long Texts. Shanghaoran Quan, Tianyi Tang, Bowen Yu, An Yang, Dayiheng Liu, Bofei Gao, Jianhong Tu, Yichang Zhang, Jingren Zhou, Junyang Lin. arXiv preprint arXiv:2410.23933, 2024
Beyond chinchilla-optimal: Accounting for inference in language model scaling laws. Nikhil Sardana, Jacob Portes, Sasha Doubov, Jonathan Frankle. arXiv preprint arXiv:2401.00448, 2023
Scaling llm test-time compute optimally can be more effective than scaling model parameters. Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar. arXiv preprint arXiv:2408.03314, 2024
Integrating Planning into Single-Turn Long-Form Text Generation. Yi Liang, You Wu, Honglei Zhuang, Li Chen, Jiaming Shen, Yiling Jia, Zhen Qin, Sumit Sanghai, Xuanhui Wang, Carl Yang, others. arXiv preprint arXiv:2410.06203, 2024
Large Language Models Can Self-Improve in Long-context Reasoning. Siheng Li, Cheng Yang, Zesen Cheng, Lemao Liu, Mo Yu, Yujiu Yang, Wai Lam. arXiv preprint arXiv:2411.08147, 2024
LUQ: Long-text Uncertainty Quantification for LLMs. Caiqi Zhang, Fangyu Liu, Marco Basaldella, Nigel Collier. arXiv preprint arXiv:2403.20279, 2024
From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data. Zheyang Xiong, Vasilis Papageorgiou, Kangwook Lee, Dimitris Papailiopoulos. 2024
With Greater Text Comes Greater Necessity: Inference-Time Training Helps Long Text Generation. Y Wang, D Ma, D Cai. arXiv preprint arXiv:2401.11504, 2024
Learning to (learn at test time): Rnns with expressive hidden states. Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, others. arXiv preprint arXiv:2407.04620, 2024

Long-Context Mllm

VideoRoPE: What Makes for Good Video Rotary Position Embedding?. Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, others. arXiv preprint arXiv:2502.05173, 2025
Visual instruction tuning. Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee. Advances in neural information processing systems, 36, 2024
Improved baselines with visual instruction tuning. Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 26296--26306, 2024
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, Yong Jae Lee. 2024
LLaVA-NeXT: A Strong Zero-shot Video Understanding Model. Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, Chunyuan Li. 2024
Llava-onevision: Easy visual task transfer. Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, others. arXiv preprint arXiv:2408.03326, 2024
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities. Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, Xipeng Qiu. Findings of the Association for Computational Linguistics: EMNLP 2023, 15757--15773, 2023
SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation. Dong Zhang, Xin Zhang, Jun Zhan, Shimin Li, Yaqian Zhou, Xipeng Qiu. arXiv preprint arXiv:2401.13527, 2024
Anygpt: Unified multimodal llm with discrete sequence modeling. Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, others. arXiv preprint arXiv:2402.12226, 2024
GRAM: Global reasoning for multi-page VQA. Tsachi Blau, Sharon Fogel, Roi Ronen, Alona Golts, Roy Ganz, Elad Ben Avraham, Aviad Aberdam, Shahar Tsiper, Ron Litman. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15598--15607, 2024
Focus Anywhere for Fine-grained Multi-page Document Understanding. Chenglong Liu, Haoran Wei, Jinyue Chen, Lingyu Kong, Zheng Ge, Zining Zhu, Liang Zhao, Jianjian Sun, Chunrui Han, Xiangyu Zhang. arXiv preprint arXiv:2405.14295, 2024
mplug-docowl2: High-resolution compressing for ocr-free multi-page document understanding. Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou. arXiv preprint arXiv:2409.03420, 2024
WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling. Xudong Xie, Liang Yin, Hao Yan, Yang Liu, Jing Ding, Minghui Liao, Yuliang Liu, Wei Chen, Xiang Bai. arXiv preprint arXiv:2410.05970, 2024
PostDoc: Generating Poster from a Long Multimodal Document Using Deep Submodular Optimization. Vijay Jaisankar, Sambaran Bandyopadhyay, Kalp Vyas, Varre Chaitanya, Shwetha Somasundaram. arXiv preprint arXiv:2405.20213, 2024
mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval. Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, others. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, 1393--1412, 2024
Mmlongbench-doc: Benchmarking long-context document understanding with visualizations. Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, others. arXiv preprint arXiv:2407.01523, 2024
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework. Yew Ken Chia, Liying Cheng, Hou Pong Chan, Chaoqun Liu, Maojia Song, Sharifah Mahani Aljunied, Soujanya Poria, Lidong Bing. arXiv preprint arXiv:2411.06176, 2024
Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, others. Advances in Neural Information Processing Systems, 36, 2024
Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution. Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao. arXiv preprint arXiv:2409.12961, 2024
Monkey: Image resolution and text label are important things for large multi-modal models. Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, Xiang Bai. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 26763--26773, 2024
Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, others. arXiv preprint arXiv:2311.07575, 2023
Dreamlip: Language-image pre-training with long captions. Kecheng Zheng, Yifei Zhang, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen, Yujun Shen. European Conference on Computer Vision, 73--90, 2025
Long-clip: Unlocking the long-text capability of clip. Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang. European Conference on Computer Vision, 310--325, 2025
LoTLIP: Improving Language-Image Pre-training for Long Text Understanding. Wei Wu, Kecheng Zheng, Shuailei Ma, Fan Lu, Yuxin Guo, Yifei Zhang, Wei Chen, Qingpei Guo, Yujun Shen, Zheng-Jun Zha. The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models. Jiapeng Wang, Chengyu Wang, Kunzhe Huang, Jun Huang, Lianwen Jin. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 16061--16075, 2024
Tulip: Token-length upgraded clip. Ivona Najdenkoska, Mohammad Mahdi Derakhshani, Yuki M Asano, Nanne van Noord, Marcel Worring, Cees GM Snoek. arXiv preprint arXiv:2410.10034, 2024
Llm2clip: Powerful language model unlock richer visual representation. Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu, Qi Dai, Xiyang Dai, Dongdong Chen, Chong Luo, others. arXiv preprint arXiv:2411.04997, 2024
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. International conference on machine learning, 19730--19742, 2023
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding. Heqing Zou, Tianze Luo, Guiyang Xie, Fengmao Lv, Guangcong Wang, Juanyang Chen, Zhuochen Wang, Hansheng Zhang, Huaijian Zhang, others. arXiv preprint arXiv:2409.18938, 2024
GIRAFFE: Design Choices for Extending the Context Length of Visual Language Models. Mukai Li, Lei Li, Shansan Gong, Qi Liu. arXiv preprint arXiv:2412.12735, 2024
Llama-vid: An image is worth 2 tokens in large language models. Yanwei Li, Chengyao Wang, Jiaya Jia. European Conference on Computer Vision, 323--340, 2025
Slowfast-llava: A strong training-free baseline for video large language models. Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, Afshin Dehghan. arXiv preprint arXiv:2407.15841, 2024
A simple llm framework for long-range video question-answering. Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, Gedas Bertasius. arXiv preprint arXiv:2312.17235, 2023
Language repository for long video understanding. Kumara Kahatapitiya, Kanchana Ranasinghe, Jongwoo Park, Michael S Ryoo. arXiv preprint arXiv:2403.14622, 2024
Understanding Long Videos in One Multimodal Language Model Pass. Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, Michael S Ryoo. arXiv preprint arXiv:2403.16998, 2024
Videoagent: A memory-augmented multimodal agent for video understanding. Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li. European Conference on Computer Vision, 75--92, 2025
Videoagent: Long-form video understanding with large language model as agent. Xiaohan Wang, Yuhui Zhang, Orr Zohar, Serena Yeung-Levy. European Conference on Computer Vision, 58--76, 2025
An image grid can be worth a video: Zero-shot video question answering using a vlm. Wonkyun Kim, Changin Choi, Wonseok Lee, Wonjong Rhee. arXiv preprint arXiv:2403.18406, 2024
Long context transfer from language to vision. Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, Ziwei Liu. URL https://arxiv.org/abs/2406.16852, 2024
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output. Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, others. arXiv preprint arXiv:2407.03320, 2024
FreeVA: Offline MLLM as Training-Free Video Assistant. Wenhao Wu. arXiv preprint arXiv:2405.07798, 2024
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. Hang Zhang, Xin Li, Lidong Bing. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 543--553, 2023
Moviechat: From dense token to sparse memory for long video understanding. Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, others. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18221--18232, 2024
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding. Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, Ser-Nam Lim. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13504--13514, 2024
Vidcompress: Memory-enhanced temporal compression for video understanding in large language models. Xiaohan Lan, Yitian Yuan, Zequn Jie, Lin Ma. arXiv preprint arXiv:2410.11417, 2024
Vista-llama: Reliable video narrator via equal distance to visual tokens. Fan Ma, Xiaojie Jin, Heng Wang, Yuchen Xian, Jiashi Feng, Yi Yang. arXiv preprint arXiv:2312.08870, 2023
Timechat: A time-sensitive multimodal large language model for long video understanding. Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, Lu Hou. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14313--14323, 2024
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning. Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat-Seng Chua, Yueting Zhuang, Siliang Tang. Forty-first International Conference on Machine Learning, 2024
T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs. Shukang Yin, Chaoyou Fu, Sirui Zhao, Yunhang Shen, Chunjiang Ge, Yan Yang, Zuwei Long, Yuhan Dai, Tong Xu, Xing Sun, others. arXiv preprint arXiv:2411.19951, 2024
Text-conditioned resampler for long form video understanding. Bruno Korbar, Yongqin Xian, Alessio Tonioni, Andrew Zisserman, Federico Tombari. European Conference on Computer Vision, 271--288, 2025
Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge. Yuxuan Wang, Yueqian Wang, Pengfei Wu, Jianxin Liang, Dongyan Zhao, Yang Liu, Zilong Zheng. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 9972--9987, 2024
LVCHAT: Facilitating Long Video Comprehension. Yu Wang, Zeyuan Zhang, Julian McAuley, Zexue He. arXiv preprint arXiv:2402.12079, 2024
Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA. Jongwoo Park, Kanchana Ranasinghe, Kumara Kahatapitiya, Wonjeong Ryoo, Donghyun Kim, Michael S Ryoo. arXiv preprint arXiv:2406.09396, 2024
KeyVideoLLM: Towards Large-scale Video Keyframe Selection. Hao Liang, Jiapeng Li, Tianyi Bai, Xijie Huang, Linzhuang Sun, Zhengren Wang, Conghui He, Bin Cui, Chong Chen, Wentao Zhang. arXiv preprint arXiv:2407.03104, 2024
Frame-Voyager: Learning to Query Frames for Video Large Language Models. Sicheng Yu, Chengkai Jin, Huanyu Wang, Zhenghao Chen, Sheng Jin, Zhongrong Zuo, Xiaolei Xu, Zhenbang Sun, Bingni Zhang, Jiawei Wu, others. arXiv preprint arXiv:2410.03226, 2024
Streaming long video understanding with large language models. Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, Jiaqi Wang. arXiv preprint arXiv:2405.16009, 2024
Videollamb: Long-context video understanding with recurrent memory bridges. Yuxuan Wang, Cihang Xie, Yang Liu, Zilong Zheng. arXiv preprint arXiv:2409.01071, 2024
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation. Weiming Ren, Huan Yang, Jie Min, Cong Wei, Wenhu Chen. arXiv preprint arXiv:2412.00927, 2024
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs. Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, others. arXiv preprint arXiv:2406.07476, 2024
Kangaroo: A powerful video-language model supporting long-context video input. Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, Jie Hu. arXiv preprint arXiv:2408.15542, 2024
Pllava: Parameter-free llava extension from images to videos for video dense captioning. Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, Jiashi Feng. arXiv preprint arXiv:2404.16994, 2024
Longvu: Spatiotemporal adaptive compression for long video-language understanding. Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, others. arXiv preprint arXiv:2410.17434, 2024
Longvlm: Efficient long video understanding via large language models. Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, Bohan Zhuang. European Conference on Computer Vision, 453--470, 2025
Matryoshka Multimodal Models. Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee. arXiv preprint arXiv:2405.17430, 2024
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding. Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu Sun, Lu Hou. Findings of the Association for Computational Linguistics: EMNLP 2023, 932--947, 2023
Video-chatgpt: Towards detailed video understanding via large vision and language models. Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan. arXiv preprint arXiv:2306.05424, 2023
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, others. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 24185--24198, 2024
Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution. Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, others. arXiv preprint arXiv:2409.12191, 2024
TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations. Mingze Gao, Jingyu Liu, Mingda Li, Jiangtao Xie, Qingbin Liu, Bo Zhao, Xi Chen, Hui Xiong. arXiv preprint arXiv:2409.03206, 2024
Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner. Yuzhang Shang, Bingxin Xu, Weitai Kang, Mu Cai, Yuheng Li, Zehao Wen, Zhen Dong, Kurt Keutzer, others. arXiv preprint arXiv:2409.12963, 2024
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding. Junqi Ge, Ziyi Chen, Jintao Lin, Jinguo Zhu, Xihui Liu, Jifeng Dai, Xizhou Zhu. arXiv preprint arXiv:2412.09616, 2024
An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang. European Conference on Computer Vision, 19--35, 2025
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration. Dezhan Tu, Danylo Vashchilenko, Yuzhe Lu, Panpan Xu. arXiv preprint arXiv:2410.23317, 2024
Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction. Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, others. arXiv preprint arXiv:2410.17247, 2024
Zipvl: Efficient large vision-language models with dynamic token sparsification and kv cache compression. Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, Bohan Zhuang. arXiv preprint arXiv:2410.08584, 2024
VisionZip: Longer is Better but Not Necessary in Vision Language Models. Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia. arXiv preprint arXiv:2412.04467, 2024
LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference. Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan. Findings of the Association for Computational Linguistics: EMNLP 2024, 4065--4078, 2024
Efficient inference of vision instruction-following models with elastic cache. Zuyan Liu, Benlin Liu, Jiahui Wang, Yuhao Dong, Guangyi Chen, Yongming Rao, Ranjay Krishna, Jiwen Lu. European Conference on Computer Vision, 54--69, 2025
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos. Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, Mohit Bansal. arXiv preprint arXiv:2405.19209, 2024
S4nd: Modeling images and videos as multidimensional signals with state spaces. Eric Nguyen, Karan Goel, Albert Gu, Gordon Downs, Preey Shah, Tri Dao, Stephen Baccus, Christopher R{\'e}. Advances in neural information processing systems, 35, 2846--2861, 2022
Long movie clip classification with state-space video models. Md Mohaiminul Islam, Gedas Bertasius. European Conference on Computer Vision, 87--104, 2022
Selective structured state-spaces for long-form video understanding. Jue Wang, Wentao Zhu, Pichao Wang, Xiang Yu, Linda Liu, Mohamed Omar, Raffay Hamid. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6387--6397, 2023
Videomamba: State space model for efficient video understanding. Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, Yu Qiao. European Conference on Computer Vision, 237--255, 2025
Video mamba suite: State space model as a versatile alternative for video understanding. Guo Chen, Yifei Huang, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, Limin Wang. arXiv preprint arXiv:2403.09626, 2024
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture. Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, Benyou Wang. arXiv preprint arXiv:2409.02889, 2024
Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision \& Language Modeling. Georgios Pantazopoulos, Malvina Nikandrou, Alessandro Suglia, Oliver Lemon, Arash Eshghi. arXiv preprint arXiv:2409.05395, 2024
Longvila: Scaling long-context visual language models for long videos. Fuzhao Xue, Yukang Chen, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, others. arXiv preprint arXiv:2408.10188, 2024
Don't Look Twice: Faster Video Transformers with Run-Length Tokenization. Rohan Choudhury, Guanglei Zhu, Sihan Liu, Koichiro Niinuma, Kris M Kitani, Laszlo Attila Jeni. The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
Temporal reasoning transfer from text to video. Lei Li, Yuanxin Liu, Linli Yao, Peiyuan Zhang, Chenxin An, Lean Wang, Xu Sun, Lingpeng Kong, Qi Liu. arXiv preprint arXiv:2410.06166, 2024
Egoschema: A diagnostic benchmark for very long-form video language understanding. Karttikeya Mangalam, Raiymbek Akshulakov, Jitendra Malik. Advances in Neural Information Processing Systems, 36, 46212--46244, 2023
Movqa: A benchmark of versatile question-answering for long-form movie understanding. Hongjie Zhang, Yi Liu, Lu Dong, Yifei Huang, Zhen-Hua Ling, Yali Wang, Limin Wang, Yu Qiao. arXiv preprint arXiv:2312.04817, 2023
Milebench: Benchmarking MLLMs in long context. Song Dingjie, Shunian Chen, Guiming Hardy Chen, Fei Yu, Xiang Wan, Benyou Wang. First Conference on Language Modeling, 2024
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, others. arXiv preprint arXiv:2405.21075, 2024
MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding. Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, Zheng Liu. arXiv preprint arXiv:2406.04264, 2024
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding. Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, Kai Chen. arXiv preprint arXiv:2406.14515, 2024
Lvbench: An extreme long video understanding benchmark. Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, others. arXiv preprint arXiv:2406.08035, 2024
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding. Haoning Wu, Dongxu Li, Bei Chen, Junnan Li. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024
LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos. Tiantian Geng, Jinrui Zhang, Qingni Wang, Teng Wang, Jinming Duan, Feng Zheng. arXiv preprint arXiv:2411.19772, 2024
Neptune: The Long Orbit to Benchmarking Long Video Understanding. Arsha Nagrani, Mingda Zhang, Ramin Mehran, Rachel Hornung, Nitesh Bharadwaj Gundavarapu, Nilpa Jha, Austin Myers, Xingyi Zhou, Boqing Gong, Cordelia Schmid, others. arXiv preprint arXiv:2412.09582, 2024

Evaluation

Long-Context Evaluation

Training verifiers to solve math word problems. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, others. arXiv preprint arXiv:2110.14168, 2021
Compressive Transformers for Long-Range Sequence Modelling. Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, Chloe Hillier, Timothy P Lillicrap. International Conference on Learning Representations, 2019
Pointer Sentinel Mixture Models. Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher. International Conference on Learning Representations, 2022
The narrativeqa reading comprehension challenge. Tomáš Kočiský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, G{\'a}bor Melis, Edward Grefenstette. Transactions of the Association for Computational Linguistics, 6, 317--328, MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info~…, 2018
QuALITY: Question Answering with Long Input Texts, Yes! Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, others. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 5336--5358, 2022
A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers. Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, Matt Gardner. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4599--4610, 2021
Efficient Attentions for Long Document Summarization. Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, Lu Wang. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1419--1436, 2021
QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization. Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, others. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 5905--5921, 2021
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, Christopher D Manning. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2369--2380, 2018
MuSiQue: Multihop Questions via Single-hop Question Composition. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, Ashish Sabharwal. Transactions of the Association for Computational Linguistics, 10, 539--554, 2022
Long Range Arena: A Benchmark for Efficient Transformers. Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, Donald Metzler. International Conference on Learning Representations, 2021
SCROLLS: Standardized CompaRison Over Long Language Sequences. Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, others. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 12007--12021, 2022
ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding. Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, Omer Levy. Findings of the Association for Computational Linguistics: EMNLP 2023, 7977--7989, 2023
Scbench: A kv cache-centric analysis of long-context methods. Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, others. arXiv preprint arXiv:2412.10319, 2024
MIR-Bench: Benchmarking LLM’s Long-Context Intelligence via Many-Shot In-Context Inductive Reasoning. Yan Kai, Ling Zhan, Liu Kang, Yang Yifan, Fan Ting-Han, Shen Lingfeng, Du Zhengyin, Chen Jiecao. arXiv preprint arXiv:2502.09933, 2025
L-eval: Instituting standardized evaluation for long context language models. Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, Xipeng Qiu. arXiv preprint arXiv:2307.11088, 2023
Longbench: A bilingual, multitask benchmark for long context understanding. Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, others. arXiv preprint arXiv:2308.14508, 2023
BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models. Zican Dong, Tianyi Tang, Junyi Li, Wayne Xin Zhao, Ji-Rong Wen. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2086--2099, 2024
M4le: A multi-ability multi-range multi-task multi-domain long-context evaluation benchmark for large language models. Wai-Chung Kwan, Xingshan Zeng, Yufei Wang, Yusen Sun, Liangyou Li, Lifeng Shang, Qun Liu, Kam-Fai Wong. arXiv preprint arXiv:2310.19240, 2023
LooGLE: Can Long-Context Language Models Understand Long Contexts?. Jiaqi Li, Mengmeng Wang, Zilong Zheng, Muhan Zhang. arXiv preprint arXiv:2311.04939, 2023
InfiniteBench: Extending long context evaluation beyond 100k tokens. Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Hao, Xu Han, Zhen Thai, Shuo Wang, Zhiyuan Liu, others. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 15262--15277, 2024
Lv-eval: A balanced long-context benchmark with 5 length levels up to 256k. Tao Yuan, Xuefei Ning, Dong Zhou, Zhijie Yang, Shiyao Li, Minghui Zhuang, Zheyue Tan, Zhuyu Yao, Dahua Lin, Boxun Li, others. arXiv preprint arXiv:2402.05136, 2024
Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective. Meizhi Zhong, Chen Zhang, Yikun Lei, Xikai Liu, Yan Gao, Yao Hu, Kehai Chen, Min Zhang. arXiv preprint arXiv:2406.13282, 2024
Round and Round We Go! What makes Rotary Positional Encodings useful?. Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, Petar Veli{\v{c}}kovi{\'c}. arXiv preprint arXiv:2410.06205, 2024
On the token distance modeling ability of higher RoPE attention dimension. Xiangyu Hong, Che Jiang, Biqing Qi, Fandong Meng, Mo Yu, Bowen Zhou, Jie Zhou. arXiv preprint arXiv:2410.08703, 2024
Circuit Complexity Bounds for RoPE-based Transformer Architecture. Bo Chen, Xiaoyu Li, Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song. arXiv preprint arXiv:2411.07602, 2024
Clongeval: A chinese benchmark for evaluating long-context large language models. Zexuan Qiu, Jingjing Li, Shijue Huang, Xiaoqi Jiao, Wanjun Zhong, Irwin King. arXiv preprint arXiv:2403.03514, 2024
XL2Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies. Xuanfan Ni, Hengyi Cai, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Piji Li. arXiv preprint arXiv:2404.05446, 2024
Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks. Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, Kai Chen. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 3712--3724, 2024
Leave no document behind: Benchmarking long-context llms with extended multi-doc qa. Minzheng Wang, Longze Chen, Fu Cheng, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, others. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 5627--5646, 2024
RepoQA: Evaluating Long Context Code Understanding. Jiawei Liu, Jia Le Tian, Vijay Daita, Yuxiang Wei, Yifeng Ding, Yuhan Katherine Wang, Jun Yang, Lingming Zhang. arXiv preprint arXiv:2406.06025, 2024
Docfinqa: A long-context financial reasoning dataset. Varshini Reddy, Rik Koncel-Kedziorski, Viet Dac Lai, Michael Krumdick, Charles Lovering, Chris Tanner. arXiv preprint arXiv:2401.06915, 2024
Systematic Evaluation of Long-Context LLMs on Financial Concepts. Lavanya Gupta, Saket Sharma, Yiyun Zhao. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, 1163--1175, 2024
Needle In A Haystack - Pressure Testing LLMs. Greg Kamradt. 2023
Multi Needle in a Haystack. LangChain. 2024
NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?. Mo Li, Songyang Zhang, Yunxin Liu, Kai Chen. arXiv preprint arXiv:2407.11963, 2024
Where am I? Large Language Models Wandering between Semantics and Structures in Long Contexts. Seonmin Koo, Jinsung Kim, YoungJoon Jang, Chanjun Park, Heui-Seok Lim. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 14144--14160, 2024
Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP. Omer Goldman, Alon Jacovi, Aviv Slobodkin, Aviya Maimon, Ido Dagan, Reut Tsarfaty. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 16576--16586, 2024
Hyper-multi-step: The Truth Behind Difficult Long-context Tasks. Yijiong Yu, Ma Xiufa, Fang Jianwei, Zhi Xu, Su Guangyao, Wang Jiancheng, Yongfeng Huang, Zhixiao Qi, Wei Wang, Weifeng Liu, others. arXiv preprint arXiv:2410.04422, 2024
Enabling Large Language Models to Generate Text with Citations. Tianyu Gao, Howard Yen, Jiatong Yu, Danqi Chen. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 6465--6488, 2023
Evaluating and Training Long-Context Large Language Models for Question Answering on Scientific Papers. Lukas Hilgert, Danni Liu, Jan Niehues. Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U), 220--236, 2024
L-CiteEval: Do Long-Context Models Truly Leverage Context for Responding?. Zecheng Tang, Keyan Zhou, Juntao Li, Baibei Ji, Jianye Hou, Min Zhang. arXiv preprint arXiv:2410.02115, 2024
Attribute or Abstain: Large Language Models as Long Document Assistants. Jan Buchmann, Xiao Liu, Iryna Gurevych. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 8113--8140, 2024
S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Model. Fangyu Lei, Qian Liu, Yiming Huang, Shizhu He, Jun Zhao, Kang Liu. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 1259--1286, 2024
Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, others. Advances in Neural Information Processing Systems, 36, 2024
Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows. Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, others. arXiv preprint arXiv:2411.07763, 2024
Holistic Reasoning with Long-Context LMs: A Benchmark for Database Operations on Massive Textual Data. Seiji Maekawa, Hayate Iso, Nikita Bhutani. arXiv preprint arXiv:2410.11996, 2024
What In-Context Learning “Learns” In-Context: Disentangling Task Recognition and Task Learning. Jane Pan, Tianyu Gao, Howard Chen, Danqi Chen. Findings of the Association for Computational Linguistics: ACL 2023, 8298--8319, 2023
Long-context llms struggle with long in-context learning. Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, Wenhu Chen. arXiv preprint arXiv:2404.02060, 2024
Helmet: How to evaluate long-context language models effectively and thoroughly. Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, Danqi Chen. arXiv preprint arXiv:2410.02694, 2024
Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack. Xiaoyue Xu, Qinyuan Ye, Xiang Ren. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024
DENIAHL: In-Context Features Influence LLM Needle-In-A-Haystack Abilities. Hui Dai, Dan Pechi, Xinyi Yang, Garvit Banga, Raghav Mantri. arXiv preprint arXiv:2411.19360, 2024
AcademicEval: Live Long-Context LLM Benchmark. Haozhen Zhang, Tao Feng, Pengrui Han, Jiaxuan You. 2024
Marathon: A race through the realm of long context with large language models. Lei Zhang, Yunshui Li, Ziqiang Liu, Junhao Liu, Longze Chen, Run Luo, Min Yang, others. arXiv preprint arXiv:2312.09542, 2023
Counting-stars: A multi-evidence, position-aware, and scalable benchmark for evaluating long-context large language models. Mingyang Song, Mao Zheng, Xuan Luo. arXiv preprint arXiv:2403.11802, 2024
RULER: What's the Real Context Size of Your Long-Context Language Models?. Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, Boris Ginsburg. arXiv preprint arXiv:2404.06654, 2024
BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack. Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev. arXiv preprint arXiv:2406.10149, 2024
Proxyqa: An alternative framework for evaluating long-form text generation with large language models. Haochen Tan, Zhijiang Guo, Zhan Shi, Lu Xu, Zhili Liu, Yunlong Feng, Xiaoguang Li, Yasheng Wang, Lifeng Shang, Qun Liu, others. arXiv preprint arXiv:2401.15042, 2024
LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs. Yuhao Wu, Ming Shan Hee, Zhiqing Hu, Roy Ka-Wei Lee. arXiv preprint arXiv:2409.02076, 2024
LongGenBench: Long-context Generation Benchmark. Xiang Liu, Peijie Dong, Xuming Hu, Xiaowen Chu. Findings of the Association for Computational Linguistics: EMNLP 2024, 865--883, 2024
Hellobench: Evaluating long text generation capabilities of large language models. Haoran Que, Feiyu Duan, Liqun He, Yutao Mou, Wangchunshu Zhou, Jiaheng Liu, Wenge Rong, Zekun Moore Wang, Jian Yang, Ge Zhang, others. arXiv preprint arXiv:2409.16191, 2024
A Benchmark for Long-Form Medical Question Answering. Pedram Hosseini, Jessica M Sin, Bing Ren, Bryceton G Thomas, Elnaz Nouri, Ali Farahanchi, Saeed Hassanpour. Advancements In Medical Foundation Models: Explainability, Robustness, Security, and Beyond, 2024
Novelqa: A benchmark for long-range novel question answering. Cunxiang Wang, Ruoxi Ning, Boqi Pan, Tonghui Wu, Qipeng Guo, Cheng Deng, Guangsheng Bao, Qian Wang, Yue Zhang. arXiv preprint arXiv:2403.12766, 2024
DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels. Zhe Xu, Jiasheng Ye, Xiangyang Liu, Tianxiang Sun, Xiaoran Liu, Qipeng Guo, Linlin Li, Qun Liu, Xuanjing Huang, Xipeng Qiu. arXiv preprint arXiv:2409.02465, 2024
One Thousand and One Pairs: A “novel” challenge for long-context language models. Marzena Karpinska, Katherine Thai, Kyle Lo, Tanya Goyal, Mohit Iyyer. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 17048--17085, 2024
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks. Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, others. arXiv preprint arXiv:2412.15204, 2024
LongIns: A Challenging Long-context Instruction-based Exam for LLMs. Shawn Gavin, Tuney Zheng, Jiaheng Liu, Quehry Que, Noah Wang, Jian Yang, Chenchen Zhang, Wenhao Huang, Wenhu Chen, Ge Zhang. arXiv preprint arXiv:2406.17588, 2024
LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios. Xiaodong Wu, Minhao Wang, Yichen Liu, Xiaoming Shi, He Yan, Xiangju Lu, Junmin Zhu, Wei Zhang. arXiv preprint arXiv:2411.07037, 2024
Many-shot jailbreaking. Cem Anil, Esin Durmus, Nina Rimsky, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Meg Tong, Jesse Mu, Daniel J Ford, others. The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
LongSafetyBench: Long-Context LLMs Struggle with Safety Issues. Mianqiu Huang, Xiaoran Liu, Shaojun Zhou, Mozhi Zhang, Chenkun Tan, Pengyu Wang, Qipeng Guo, Zhe Xu, Linyang Li, Zhikai Lei, others. arXiv preprint arXiv:2411.06899, 2024
Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?. Jonathan Roberts, Kai Han, Samuel Albanie. arXiv preprint arXiv:2411.05000, 2024
IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark. Kawshik Manikantan, Makarand Tapaswi, Vineet Gandhi, Shubham Toshniwal. arXiv preprint arXiv:2411.07466, 2024
Michelangelo: Long context evaluations beyond haystacks via latent structure queries. Kiran Vodrahalli, Santiago Ontanon, Nilesh Tripuraneni, Kelvin Xu, Sanil Jain, Rakesh Shivanna, Jeffrey Hui, Nishanth Dikkala, Mehran Kazemi, Bahare Fatemi, others. arXiv preprint arXiv:2409.12640, 2024
Large language models are not fair evaluators. Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, Zhifang Sui. arXiv preprint arXiv:2305.17926, 2023
Benchmarking General Purpose In-Context Learning. Fan Wang, Chuan Lin, Yang Cao, Yu Kang. arXiv preprint arXiv:2405.17234, 2024
Length-controlled alpacaeval: A simple way to debias automatic evaluators. Yann Dubois, Bal{\'a}zs Galambosi, Percy Liang, Tatsunori B Hashimoto. arXiv preprint arXiv:2404.04475, 2024
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?. Jinhyuk Lee, Anthony Chen, Zhuyun Dai, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, Yi Luan, S{\'e}bastien MR Arnold, Vincent Perot, Siddharth Dalmia, others. arXiv preprint arXiv:2406.13121, 2024
Judging llm-as-a-judge with mt-bench and chatbot arena. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, others. Advances in Neural Information Processing Systems, 36, 46595--46623, 2023
Time Travel in LLMs: Tracing Data Contamination in Large Language Models. Shahriar Golchin, Mihai Surdeanu. The Twelfth International Conference on Learning Representations, 2024
ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models. Thibaut Thonet, Jos Rozen, Laurent Besacier. arXiv preprint arXiv:2403.20262, 2024
A Benchmark for Long-Form Medical Question Answering. Pedram Hosseini, Jessica M Sin, Bing Ren, Bryceton G Thomas, Elnaz Nouri, Ali Farahanchi, Saeed Hassanpour. Advancements In Medical Foundation Models: Explainability, Robustness, Security, and Beyond, 2024
LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation. Xi Ye, Fangcong Yin, Yinghui He, Joie Zhang, Howard Yen, Tianyu Gao, Greg Durrett, Danqi Chen. arXiv preprint arXiv:2501.05414, 2025

Unanswered Question

Lost in the Middle: How Language Models Use Long Contexts. Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang. Transactions of the Association for Computational Linguistics, 11, 157--173, 2024
Attention instruction: Amplifying attention in the middle via prompting. Meiru Zhang, Zaiqiao Meng, Nigel Collier. arXiv preprint arXiv:2406.17095, 2024
Found in the middle: Calibrating positional attention bias improves long context utilization. Cheng-Yu Hsieh, Yung-Sung Chuang, Chun-Liang Li, Zifeng Wang, Long T Le, Abhishek Kumar, James Glass, Alexander Ratner, Chen-Yu Lee, Ranjay Krishna, others. arXiv preprint arXiv:2406.16008, 2024
Order-Independence Without Fine Tuning. Reid McIlroy-Young, Katrina Brown, Conlan Olson, Linjun Zhang, Cynthia Dwork. The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
Seed-story: Multimodal long story generation with large language model. Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, Yingcong Chen. arXiv preprint arXiv:2407.08683, 2024
Long context rag performance of large language models. Quinn Leng, Jacob Portes, Sam Havens, Matei Zaharia, Michael Carbin. arXiv preprint arXiv:2411.03538, 2024
When Attention Sink Emerges in Language Models: An Empirical View. Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, Min Lin. arXiv preprint arXiv:2410.10781, 2024
Same task, more tokens: the impact of input length on the reasoning performance of large language models. Mosh Levy, Alon Jacoby, Yoav Goldberg. arXiv preprint arXiv:2402.14848, 2024
The unlocking spell on base llms: Rethinking alignment via in-context learning. Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, Yejin Choi. The Twelfth International Conference on Learning Representations, 2024
Why We Need New Evaluation Metrics for NLG. Jekaterina Novikova, Ond{\v{r}}ej Du{\v{s}}ek, Amanda Cercas Curry, Verena Rieser. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2241--2252, 2017
Lxmert: Learning cross-modality encoder representations from transformers. Hao Tan, Mohit Bansal. arXiv preprint arXiv:1908.07490, 2019
Visually grounded concept composition. Bowen Zhang, Hexiang Hu, Linlu Qiu, Peter Shaw, Fei Sha. arXiv preprint arXiv:2109.14115, 2021
In-context learning with long-context models: An in-depth exploration. Amanda Bertsch, Maor Ivgi, Uri Alon, Jonathan Berant, Matthew R Gormley, Graham Neubig. arXiv preprint arXiv:2405.00200, 2024
**Demonstrations in In-context Learning for LLMs with Large Label Space.** Zhan Li, Fanghui Liu, Volkan Cevher, Grigorios Chrysos. First Workshop on Long-Context Foundation Models@ ICML 2024, 2024
Probing the Decision Boundaries of In-context Learning in Large Language Models. Siyan Zhao, Tung Nguyen, Aditya Grover. NeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning, 2024
CoNT: Contrastive Neural Text Generation. Chenxin An, Jiangtao Feng, Kai Lv, Lingpeng Kong, Xipeng Qiu, Xuanjing Huang. Advances in Neural Information Processing Systems, 35, 2197--2210, 2022
Many-Shot In-Context Learning. Rishabh Agarwal, Avi Singh, Lei M Zhang, Bernd Bohnet, Luis Rosias, Stephanie CY Chan, Biao Zhang, Aleksandra Faust, Hugo Larochelle. ICML 2024 Workshop on In-Context Learning, 2024
When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training. Haonan Wang, Qian Liu, Chao Du, Tongyao Zhu, Cunxiao Du, Kenji Kawaguchi, Tianyu Pang. arXiv preprint arXiv:2411.13476, 2024
Retrieval or Global Context Understanding? On Many-Shot In-Context Learning for Long-Context Evaluation. Kaijian Zou, Muhammad Khalifa, Lu Wang. arXiv preprint arXiv:2411.07130, 2024
Cross-Modal Consistency in Multimodal Large Language Models. Xiang Zhang, Senyu Li, Ning Shi, Bradley Hauer, Zijun Wu, Grzegorz Kondrak, Muhammad Abdul-Mageed, Laks VS Lakshmanan. arXiv preprint arXiv:2411.09273, 2024
MM-R3: On (In-) Consistency of Multi-modal Large Language Models (MLLMs). Shih-Han Chou, Shivam Chandhok, James J Little, Leonid Sigal. arXiv preprint arXiv:2410.04778, 2024
Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation. Marius Mosbach, Tiago Pimentel, Shauli Ravfogel, Dietrich Klakow, Yanai Elazar. Findings of the Association for Computational Linguistics: ACL 2023, 12284--12314, 2023
Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers. Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, Furu Wei. Findings of the Association for Computational Linguistics: ACL 2023, 4005--4019, 2023

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md
README_zh.md		README_zh.md
Thus_Spake_Long_Context_LLM-v1.pdf		Thus_Spake_Long_Context_LLM-v1.pdf
paper_structure.png		paper_structure.png
report_ppt-v250227-zh.pdf		report_ppt-v250227-zh.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thus Spake Long-Context Large Language Model

News

Table of Contents

Paper List

Survey & Technical Report

Architecture

Length Extrapolation

Kv Cache Optimization

Memory Management

Architecture Innovation

Infrastructure

Training Infrastructure

Inference Infrastructure

Training

Long-Context Pre-Training

Long-Context Post-Training

Long-Context Mllm

Evaluation

Long-Context Evaluation

Unanswered Question

About

Releases

Packages

Contributors 2

OpenMOSS/Thus-Spake-Long-Context-LLM

Folders and files

Latest commit

History

Repository files navigation

Thus Spake Long-Context Large Language Model

News

Table of Contents

Paper List

Survey & Technical Report

Architecture

Length Extrapolation

Kv Cache Optimization

Memory Management

Architecture Innovation

Infrastructure

Training Infrastructure

Inference Infrastructure

Training

Long-Context Pre-Training

Long-Context Post-Training

Long-Context Mllm

Evaluation

Long-Context Evaluation

Unanswered Question

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages