Inference time improvements
We covered a number of important techniques to bring in efficiencies during the overall training workflow. However, a major part of an LLM’s lifecycle is the inference aspect (i.e., the actual utilization of such models for different real-world use cases). Due to their immense size, the infrastructure requirements are very large and expensive. To improve upon this and bring down associated operational costs, the following techniques prove quite beneficial:
- Offloading is a smart way of leveraging compute and data storage responsibilities across hardware devices effectively. The most widely used techniques involve moving parts of the model (layers/blocks) to secondary memory or NVMe when not actively used. This reduces GPU memory usage and allows for larger models to fit within limited resources. Microsoft’s DeepSpeed and Hugging Face’s bitsandbytes are two popular libraries that provide interfaces to handle such capabilities...