AArch64.cloud

..

Benchmark tests for ARM-based AWS Graviton, Google Axion and Azure Cobalt CPUs

So, it started about a month ago.. Microsoft Azure came up with their latest ARM based ‘Cobalt-100’ processor. After AWS and Google Cloud, now Microsoft is also competing in this new arena of efficient and purpose built ARM chips. So I wanted to run a benchmark test for these custom silicon.

Arm based AWS Graviton family by Annapurna Labs Arm-based AWS Graviton family by Annapurna Labs.

ARM chips from hyperscale cloud providers:

Maker Generation Microarchitecture vCPU year
AWS Graviton2 Neoverse N1 64 2020
Google Axion Neoverse V2 72 2024
Microsoft Azure Cobalt 100 Neoverse N2 96 2024

Major cloud providers had their purpose-built server, data centre for quite some time. But the performance competition at hyper-scale providers recently has observed a new front. Every hyper-scale cloud provider now offers custom made silicon. These CPUs are power efficient, purpose-built for modern web scaleout workloads – claimed to be 60% more efficient than their x86 counterparts. So, I was curious to learn how these custom made Arm based CPUs perform under different workloads. This benchmark test is designed specifically for the Arm based custom CPUs offered by major cloud providers. The tests are run on cloud instances which are of similar price range($0.18-$0.20 per hour) irrespective of their respective vCPU or core count.

Azure Cobalt 100 and Google Axion Azure Cobalt 100 and Google Axion - image: Microsoft & Google

Sysbench tests for ARM-based CPU sysbench tests for ARM-based CPU: CPU, memory & I/O performance

Now let’s compare these results with latest x86 CPUs from Intel and AMD:

Chips instance vCPU/RAM
Amazon Graviton c6g.2xlarge 8/16
Google Axion c4a-highcpu-8 8/16
Azure Cobalt D4ps v6 4/16
Ampere Altra t2a-standard-4 4/16
Emerald Rapids n4-standard-4 4/16
AMD Genoa t2a-standard-4 4/16

Sysbench tests for ARM and x86 CPU sysbench tests for ARM-based & x86 CPUs: CPU performance

Arm based cloud servers are now available on AWS, Google and Azure. However, the first company to provide production grade Arm based cloud servers was none of these hyper scale cloud providers. If my memory serves me right, back in 2018, an European cloud provider ‘Scaleway’ offered Arm based cloud instances. I was already using their some other products. So, I was curious on these new cloud offerings. What I have found back then was that the software eco-system for Arm-based cloud servers was not in any way mature. It was almost impossible to find many web application packages available for aarch64 platform. So my venture into earlier versions of Arm-based cloud servers was not fruitful. Here, I have arranged a benchmark test for Google Axion, Microsoft Azure Cobalt, and AWS Graviton series processors.

I have run two separate benchmark tests:

  • Sysbench tests for CPU, Memory, I/O
  • LLM inferencing (token/second) benchmark test for CPU

While sysbench tests were run using the sysbench library, for token/second benchmark, I have used my LLM inferencing benchmark library.

Token/second benchmark for ARM based AWS Graviton, Google Axion, Azure Cobalt 100 and Ampere Altra CPUs Token/second benchmark for ARM based AWS Graviton, Google Axion, Azure Cobalt 100 and Ampere Altra CPUs using Llama3.2:1b model

These tests were run on different ARM-based CPUs developed by AWS, GCP and Azure. While the Ampere Altra CPUs are currently available on multiple cloud providers like OCI, Scaleway, hetzner. The later tests compare these benchmark with latest generation x86 CPUs from Intel and AMD on a similar price(per hour) range. Both the latest Intel Emerald and AMD Genoa closely contested on LLM inferencing benchmark with their ARM-based counterparts. Google Axion had slight edge over AWS Graviton on both llama3.1:8b and llama3.2:1b token/second benchmarking.

Token/second benchmark test for ARM-based and x86 CPUs using llama3.1:8b model Token/second benchmark test for ARM-based and x86 CPUs using llama3.1:8b model

These tests were performed mostly on CPUs on a similar price range. That is these tests are not representing the entire CPUs. My account quota(i.e. vCPU/region) on different cloud providers were a bottleneck on these tests. Though, these tests are supposed to be representative of the respective CPUs, running these tests on bare metal CPUs would be more reliable.

The code repository for this benchmark test is here:
github.com/ikthyandr/LLM-inference-as-a-CPU-Benchmark.