http://arxiv.org/abs/2110.08633

Jan. 26, 2022 | Kabir Nagrecha, Arun Kumar

cs.LG updates on arXiv.org arxiv.org

Training deep learning (DL) models that do not fit into the memory of a
single GPU is a vexed process, forcing users to procure multiple GPUs to adopt
model-parallel execution. Unfortunately, sequential dependencies in neural
architectures often block efficient multi-device training, leading to
suboptimal performance. We present 'model spilling', a technique aimed at
models such as Transformers and CNNs to move groups of layers, or shards,
between DRAM and GPU memory, thus enabling arbitrarily large models to be
trained even …

