June 17, 2024, 4:45 a.m. | Shivam Aggarwal, Hans Jakob Damsgaard, Alessandro Pappalardo, Giuseppe Franco, Thomas B. Preu{\ss}er, Michaela Blott, Tulika Mitra

cs.LG updates on arXiv.org arxiv.org

arXiv:2311.12359v2 Announce Type: replace-cross
Abstract: Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point formats(FP8) in the context of PTQ for model inference. However, floating-point formats smaller than 8 bits and their relative comparison in terms of accuracy-hardware cost with integers remains unexplored on FPGAs. In this work, we present minifloats, which are reduced-precision floating-point formats capable of further reducing the …

abstract arxiv compression context cs.ai cs.ar cs.cv cs.lg cs.pf fpgas however inference model inference networks neural networks numerical precision quantization replace training type

AI Focused Biochemistry Postdoctoral Fellow

@ Lawrence Berkeley National Lab | Berkeley, CA

Senior Data Engineer

@ Displate | Warsaw

Data Architect

@ Unison Consulting Pte Ltd | Kuala Lumpur, Federal Territory of Kuala Lumpur, Malaysia

Data Architect

@ Games Global | Isle of Man, Isle of Man

Enterprise Data Architect

@ Ent Credit Union | Colorado Springs, CO, United States

Lead Data Architect (AWS, Azure, GCP)

@ CapTech Consulting | Chicago, IL, United States