all AI news
Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems. (arXiv:2304.10892v1 [cs.LG])
cs.LG updates on arXiv.org arxiv.org
The use of machine learning (ML) inference for various applications is
growing drastically. ML inference services engage with users directly,
requiring fast and accurate responses. Moreover, these services face dynamic
workloads of requests, imposing changes in their computing resources. Failing
to right-size computing resources results in either latency service level
objectives (SLOs) violations or wasted computing resources. Adapting to dynamic
workloads considering all the pillars of accuracy, latency, and resource cost
is challenging. In response to these challenges, we propose …
accuracy applications arxiv challenges computing cost dynamic efficiency face inference latency low machine machine learning ml inference resources responses service service level objectives services set systems