AMD EYPC Processor Library Support

Important Notes:

  • Open-source libraries and Intel MKL work well on AMD platform.
  • For Intel MKL, please setup a proper environment variable “export MKL_DEBUG_CPU_TYPE =5 ” in your bashrc

Library: BLAS

Library: LAPACK

Library: FFTW

Library: ScaLAPACK

Library: Core Math library

Library: Random number generator library
AOCL: RNG Library

Library: Secure RNG Library
AOCL: Secure RNG library


General Linux OS Tuning for AMD EPYC

Step 1: Turn off swap
Turn off swap to prevent accidental swapping. Do not that disabling swap without sufficient memory can have undesired effects

swapoff -a

Step 2: Turn off NUMA balancing
NUMA balancing can have undesired effects and since it is possible to bind the ranks and memory in HPC, this setting is not needed

echo 0 > /proc/sys/kernel/numa_balancing

Step 3: Disable ASLR (Address Space Layout Ranomization) is a security feature used to prevent the exploitation of memory vulnerabilities

echo 0 > /proc/sys/kernel/randomize_va_space

Step 4: Set CPU governor to performance and disable cc6. Setting the CPU perfomance to governor to perfomrnaces ensures max performances at all times. Disabling cc6 ensures that deeper CPU sleep states are not entered.

cpupower frequency-set -g performance
Setting cpu: 0
Setting cpu: 1
cpupower idle-set -d 2
Idlestate 2 disabled on CPU 0
Idlestate 2 disabled on CPU 1
Idlestate 2 disabled on CPU 2


  1. Tuning Guard for AMD EPYC (pdf)

Getting Useful Information on CPU and Configuration

Point 1. lscpu

To install

yum install util-linux

lscpu – (Print out information about CPU and its configuration)

[user1@myheadnode1 ~]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz
Stepping: 4
CPU MHz: 3200.000
BogoMIPS: 6400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 25344K
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31
Flags: fpu .................

Point 2: hwloc-ls

To install hwloc-ls

yum install hwloc

hwloc – (Prints out useful information about the NUMA locality of devices and general hardware locality information)

[user1@myheadnode1 ~]# hwloc-ls
Machine (544GB total)
NUMANode L#0 (P#0 256GB)
Package L#0 + L3 L#0 (25MB)
L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#16)
L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#17)
L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#18)
L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#19)
L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#20)
L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#21)
L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#22)
L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#23)

Point 3 – Check whether the Boost is on for AMD

Print out if CPU boost is on or off

cat /sys/devices/system/cpu/cpufreq/boost


  1. Tuning Guard for AMD EPYC (pdf)

BOIS settings for OEM Server with EPYC

Taken from Chapter 4 of


Selected Explanation of Setting. (See Document for FULL explanation)

1. Simultaneous Mult-Threading (SMT) or HyperThreading (HT)

  • IN HPC Workload, the SMT are usually turned off

2. x2APIC

  • This option helps the operating system deal with interrupt more efficiently in high cores count configuration. It is recommended to enable this option. This option must be enabled if  using more than 255 threads

3. Numa Per Socket (NPS)

  • In many HPC applications, ranks and memory can be pinned to cores and NUMA Nodes. The recommended value should be NPS4 option. However, if the workload is not NUMA aware or suffers when the NUMA complexity increase, we can experiment with NSP1.

4. Memory Frequency, Infinity Fabric Frequency, and coupled ve uncoupled mode

Memory Clock and Infinity Fabric Clock can run at synchronous frequencies (coupled mode) or at asynchronous frequencies (uncoupled mode)

  • If the memory is clocked at lower than 2933 MT/s, the memory and fabric will run in coupled mode which has the lowest memory latency
  • If the memory is clocked at  3200 MT/s, the memory and fabric clock will run in asynchronous mode has higher bandwidth but increased memory latency.
  • Make sure APBDIS is set to 1 and fixed SOC Pstate is set to P0

5. Preferred IO

Preferred IO allows one PCIe device in the system to be configured in a preferred mode. This device gets preferential treant on the infinity fabric

6. Determinism Slider

  • Recommended to choose Power Option. For this mode, the CPUs in the system performance at the maximum capability of each silicon device. Due to the natural variation existing during the manufacturing process, some CPUs performances may be varied,  but will never fall below “Performance Determinism mode”


AMD’s EPYC™ 7002 HPC Benchmarks over Mellanox solutions

Below are the links to the current HPC Performance Briefs on EPYC 7002 showcasing performance with some of the most main stream applications used in HPC including Gromacs, Weather modeling with WRF and more, including CFD and FEA applications.

  2. WRF
  3. ESI Virtual Performance Solution
  4. LS-DYN
  5. Altair Radioss


Mellanox Spectrum Switch & ConnectX-4 25/100GbE

  2. AMD EPYC™ 7002 Series Processors Best Four Node Benchmark Result VMmark® 3 Using Vmware vSAN® – Benchmark: VMmark over VMware’s ESXi 6.7U3 vSAN
  3. AMD EPYC™ 7002 Series Processors Set New World Record on VMmark® 3 Virtualization Platform Benchmark – Benchmark: Virtualization running Vmmark
  4. AMD EPYC™ 7002 Series Processors Achieve Best-in-Class Results on Internet-of-Things Benchmark with Four Nodes – Benchmarks: TPC Express Benchmark™ IoT (TPCx-IoT™)


Mellanox NIC only

  1. TPCx-HS @ 30 TB (Hortonworks on HP DL325) – Benchmarks: TPC over Big Data using Apache Hadoop
  2. AMD EPYC™ Processor Extends Leadership with Best-in-Class Results on Industry-Standard Big Data Benchmark -TPCx-HS 10TB scale factor (Cloudera on Dell):Cloudera on 17-node Dell R6415 cluster. – Benchmarks: TPC Express HS (TPCx-HS) over Big Data using Apache Hadoop
  3. AMD EPYC™ Processor Achieves Best-in-Class Results on Industry-Standard Internet of Things Benchmark -TPCx-IoT (HBASE): Cloudera HBASE on 4-node Dell R6415 cluster. – Benchmarks: TPC Express HS (TPCx-HS) over Big Data using Apache Hadoop
  4. AMD EPYC Processor Achieves Best-in-Class Results on Industry-Standard Big Data Benchmark on Big Data benchmark, TPCx-HS @ 1 TB scale factor- Cloudera on 17-node Dell R6415 cluster – Benchmarks: TPC Express HS (TPCx-HS) over Big Data using Apache Hadoop