Inference Engines

Target venues: system conferences (OSDI/SOSP/ATC/EuroSys/ASPLOS), network conferences (NSDI/SIGCOMM), mobile conferences (MobiCom/MobiSys/SenSys/UbiComp).

Survey

A Survey of Multi-Tenant Deep Learning Inference on GPU https://arxiv.org/pdf/2203.09040.pdf
Full Stack Optimization of Transformer Inference: a Survey https://arxiv.org/abs/2302.14017

Edge-based Acceleration

[SenSys’23] Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU | Chinese University of Hong Kong, City University of Hong Kong
[SenSys’23 Best Paper Runner-up Award] nnPerf: Demystifying DNN Runtime Inference Latency on Mobile Platforms | Beijing University of Posts and Telecommunications

Survey

Edge-based Acceleration

Cloud-based Acceleration