The leading semiconductor manufacturer AMD’S Milan-X EPYC series processors could be expected at this conference. Judging from the latest news, this series of processors uses unique 3D V-cache technology (3D caches act as a rapid refresher, it uses a novel new hybrid bonding technique) even before the Vermeer-X consumer product line. The line is based on the Zen3 micro-architecture. We always put AMD vs. Intel in a clash to witness who is better but always ends up with neutral ideas; both AMD and Intel are relentlessly attempting to prove their side is in the better format. Whilst AMD is expected to launch its first machine based on (MCM) (Multi-Chip Module Design), even earlier than NVIDIA GH100 (Hopper) and Intel’s Ponte Vecchio (Xe-HPC).
AMD is set to launch new HPC products on November 8 at the “Accelerated Data Center Premier”
AMD EYPC
Product Brief with AMD EPYC 7003
3rd Gen AMD EPYC™ processors raise the bar once more for workload performance, with up to 19% more instructions per clock (IPC)1. No matter the job, you can drive faster time to results, provide more and better data for decisions, and achieve better business outcomes. With our leadership approach, the world’s highest performance server CPU, AMD EPYC 7763,2 and AMD Infinity Architecture deliver innovation —up to 32MB of L3 cache per core, synchronized fabric and memory clock speeds designed for improved performance, plus hardware and virtual security features to help safeguard your business—right out of the box
AMD EPYC Product Brief (Technical | In-depth details about your new AMD EPYC 7003 Series Processers (pathfactory.com))
References:
Introducing 3rd Gen AMD Processors for the Modern Data Centre
Join CEO Dr. Lisa Su, CTO Mark Papermaster, Senior VP and GM of Datacenter and Embedded Solutions Business Group, Forrest Norrod, Senior VP and GM of Server Business Unit, Dan McNamara, and appearances by industry-leading data center strategic partners and customers in this digital launch of the 3rd Gen AMD EPYC™ Processors.
Chapters:
00:00 – Intro
01:00 – Introducing 3rd Gen AMD EPYC
07:48 – “Zen 3” Architecture for Data Center
15:24 – 3rd Gen AMD EPYC Portfolio & Performance:
20:44 – HPC Performance Leadership & Exascale Computing
25:44 – Powering the Most Important Cloud Services
35:54 – Accelerating Enterprise Workloads
40:42 – AMD EPYC Solution Ecosystem
49:22 – Conclusion
IntelMPI Application Tuning for AMD EPYC
If you wish to Intel MPI on AMD EPYC Servers, you have to change your MPI
-genv I_MPI_DEBUG=5 -genv I_MPI_PIN=1 -genv KMP_AFFINITY verbose,granularity=fine,compact
Explnation of Options:
-genv I_MPI_DEBUG=5
(Enable debug output to print transport and pinning information)
-genv I_MPI_PIN=1
(Enables Rank Pining. Use in conjunction with the previous options)
-genv KMP_AFFINITY verbose,granularity=fine,compact
( For more information, Thread Affinity Interface (Linux* and Windows*) )
References:
AMD EYPC Processor Library Support
Important Notes:
- Open-source libraries and Intel MKL work well on AMD platform.
- For Intel MKL, please setup a proper environment variable “export MKL_DEBUG_CPU_TYPE =5 ” in your bashrc
Library: BLAS
AOCL: BLIS
URLs: https://developer.amd.com/amd-aocl/blas-library/
Library: LAPACK
AOCL: libFLAME
URLs: https://developer.amd.com/amd-aocl/blas-library/#libflame
Library: FFTW
AOCL: FFTW
URLs: https://developer.amd.com/amd-aocl/fftw/
Library: ScaLAPACK
AOCL: ScaLAPACK
URLs: https://github.com/amd/scalapack
Library: Core Math library
AOCL: LibM
URLs: https://developer.amd.com/amd-aocl/amd-math-library-libm/
Library: Random number generator library
AOCL: RNG Library
URLs: https://developer.amd.com/amd-aocl/rng-library/
Library: Secure RNG Library
AOCL: Secure RNG library
URLs: https://developer.amd.com/amd-aocl/rng-library/#securerng
ISC 2020 Exhibitor Forum: Advanced Micro Devices (AMD)
AMD is raising the bar on compute densities by enabling optimized server designs with high performance, low latency, and excellent efficiency in an open, flexible environment.
General Linux OS Tuning for AMD EPYC
Step 1: Turn off swap
Turn off swap to prevent accidental swapping. Do not that disabling swap without sufficient memory can have undesired effects
swapoff -a
Step 2: Turn off NUMA balancing
NUMA balancing can have undesired effects and since it is possible to bind the ranks and memory in HPC, this setting is not needed
echo 0 > /proc/sys/kernel/numa_balancing
Step 3: Disable ASLR (Address Space Layout Ranomization) is a security feature used to prevent the exploitation of memory vulnerabilities
echo 0 > /proc/sys/kernel/randomize_va_space
Step 4: Set CPU governor to performance and disable cc6. Setting the CPU perfomance to governor to perfomrnaces ensures max performances at all times. Disabling cc6 ensures that deeper CPU sleep states are not entered.
cpupower frequency-set -g performance
Setting cpu: 0 Setting cpu: 1 ..... .....
cpupower idle-set -d 2
Idlestate 2 disabled on CPU 0 Idlestate 2 disabled on CPU 1 Idlestate 2 disabled on CPU 2 ..... .....
References:
Getting Useful Information on CPU and Configuration
Point 1. lscpu
To install
yum install util-linux
lscpu – (Print out information about CPU and its configuration)
[user1@myheadnode1 ~]$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz Stepping: 4 CPU MHz: 3200.000 BogoMIPS: 6400.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 25344K NUMA node0 CPU(s): 0-7,16-23 NUMA node1 CPU(s): 8-15,24-31 Flags: fpu .................
Point 2: hwloc-ls
To install hwloc-ls
yum install hwloc
hwloc – (Prints out useful information about the NUMA locality of devices and general hardware locality information)
[user1@myheadnode1 ~]# hwloc-ls Machine (544GB total) NUMANode L#0 (P#0 256GB) Package L#0 + L3 L#0 (25MB) L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#16) L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 PU L#2 (P#1) PU L#3 (P#17) L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 PU L#4 (P#2) PU L#5 (P#18) L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 PU L#6 (P#3) PU L#7 (P#19) L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 PU L#8 (P#4) PU L#9 (P#20) L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 PU L#10 (P#5) PU L#11 (P#21) L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 PU L#12 (P#6) PU L#13 (P#22) L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 PU L#14 (P#7) PU L#15 (P#23) ..... ..... .....
Point 3 – Check whether the Boost is on for AMD
Print out if CPU boost is on or off
cat /sys/devices/system/cpu/cpufreq/boost 1
References:
BOIS settings for OEM Server with EPYC
Taken from Chapter 4 of https://developer.amd.com/wp-content/resources/56827-1-0.pdf
Selected Explanation of Setting. (See Document for FULL explanation)
1. Simultaneous Mult-Threading (SMT) or HyperThreading (HT)
- IN HPC Workload, the SMT are usually turned off
2. x2APIC
- This option helps the operating system deal with interrupt more efficiently in high cores count configuration. It is recommended to enable this option. This option must be enabled if using more than 255 threads
3. Numa Per Socket (NPS)
- In many HPC applications, ranks and memory can be pinned to cores and NUMA Nodes. The recommended value should be NPS4 option. However, if the workload is not NUMA aware or suffers when the NUMA complexity increase, we can experiment with NSP1.
4. Memory Frequency, Infinity Fabric Frequency, and coupled ve uncoupled mode
Memory Clock and Infinity Fabric Clock can run at synchronous frequencies (coupled mode) or at asynchronous frequencies (uncoupled mode)
- If the memory is clocked at lower than 2933 MT/s, the memory and fabric will run in coupled mode which has the lowest memory latency
- If the memory is clocked at 3200 MT/s, the memory and fabric clock will run in asynchronous mode has higher bandwidth but increased memory latency.
- Make sure APBDIS is set to 1 and fixed SOC Pstate is set to P0
5. Preferred IO
Preferred IO allows one PCIe device in the system to be configured in a preferred mode. This device gets preferential treant on the infinity fabric
6. Determinism Slider
- Recommended to choose Power Option. For this mode, the CPUs in the system performance at the maximum capability of each silicon device. Due to the natural variation existing during the manufacturing process, some CPUs performances may be varied, but will never fall below “Performance Determinism mode”