We have benchmarked how many samples per second different models can process on different hardware devices. Below are some results for BERT-Squad, MobileNetV2, ResNet50, SuperResolution, YOLOv4 and FastNeuralStyleTransfer.
We always used batch size of 1 which is relevant for real time request response applications.
Benchmarks for c5a.4xlarge, an AWS EC2 CPU compute instance. Higher is better.
Benchmarks for g4dn.xlarge,
an AWS EC2 GPU instance.
Some model conversions failed, that is why some backend results are missing.
Comparison of the following similarly priced AWS EC2 instances in the us-east-1 region.
Instance Type | Device | Cost |
---|---|---|
c5n.2xlarge | CPU | $0.432 |
g4dn.xlarge | GPU | $0.526 |
c6g.4xlarge | ARM | $0.544 |
c5a.4xlarge | CPU | $0.616 |
Only the performance of the best backend for each of the instances is shown.
More expensive instance does not always deliver higher throughput. Also, notice that the same model on the same device type (CPU) is sometimes faster using one backend sometimes another.
If your application requires a large amount of requests per second the GPU seems to be the cheapest option to choose from. If your demands are lower, a cheaper compute instance or a group of them might be a better option. See the graph below. Red dashed lines represent two applications with different requirements, y-axis is in logscale.
In any case, running a DNN-Bench before deploying and identifying the best inference backend for your model can save you a lot of cost and increase your model's throughput.