Inference accelerator

GPUs vs AWS Inferentia vs Amazon EIs

Deep learning inference acceleration landscape

CPUs acquired support for advanced vector extensions (AVX-512)
GPUs acquired new capabilities such as support for reduced precision arithmetic (FP16 and INT8) further accelerating inference
AWS Inferentia, a custom-designed ASIC

precisions:
- automatically cast your FP32 model to BF16
- or can provide model in FP16
increase performance by:
- batching: reducing cost of loading weights for each layer from external memory for each input
- pipelining: load weights of different subgraphs on different NeuronCores
- both require setting options during compilation
Using all NeuronCores on your Inf1 instances:
- smallest instance type: inf1.xlarge automatically perform data parallel execution on all 4 NeuronCores (=replicating your model 4 times and loading it into each NeuronCore and running 4 Python threads to feed input to data to each core)
- larger instance: must spawn multiple threads and use python threadpools
divide NeuronCores to run different models

Attached through the network using an AWS PrivateLink

Why choose Amazon EI over dedicated GPU instances?

if don’t have sufficient demand or multiple models to serve and share the GPU to keep up utilization, can attach "just enough" GPU acceleration to a CPU instance
The cost of the CPU instance + EI accelerator would still be cheaper than a dedicated GPU instance
EI adds some latency compared to GPU instance, but can be faster than CPU only
need to use an EI enabled framework such as TensorFlow, PyTorch or Apache MXNet

G4 instance: NVIDIA T4 GPUs
- go-to for inference
- FP64, FP32, FP16, Tensor Cores (mixed-precision), and INT8 precision types
- 16 GB of GPU memory
P3: if need more throughput or need more memory per GPU