Webinar: High Performance GPU Acceleration – Part 1: Code Design

  • Online Registration Here
  • Date: 13th October 2021 9am PDF

Heterogeneous computing comes with the challenge of designing code that can work in multi-processor/accelerator environments. Developers need to be equipped with the right set of metrics to make informed design and optimization decisions that take advantage of target hardware.

In Part 1 of this 2-part webinar series, Technical Consulting Engineer Cory Levels focuses on designing software for efficient offload from CPUs to GPUS—even before final hardware is available—using Intel® Advisor. Using a walkthrough of an ISO 3DFD example (3D isotropic Finite Difference), you will learn how to:

  • Optimize your CPU application for memory and compute
  • Identify efficient GPU offload opportunities and quantify the potential performance speed up
  • See performance headroom of your GPU offloaded code against hardware limitations, and get insights for an effective optimization roadmap

For More information, do take a look at the Intel Site Here.

Intel unveil Second-Generation Neuromorphic Chip

Various processors and pieces of code are often compared to brains, but neuromorphic chips work to much more directly mimic neurological systems through the use of computational “neurons” that communicate with one another. Intel’s first-generation Loihi chip, introduced in 2017, has around 128,000 of those digital neurons. Over the ensuing four years, Loihi has been packed into increasingly large systemslearned to touch and even been taught to smell.

Now, it’s getting a new family member: Loihi 2. In its press release, Intel said that years of testing with the first-generation Loihi chip helped them to design a second generation with up to ten times the processing speed; up to 15 times greater resource density; and up to a million computational neurons per chip – more than seven times those in the first generation. Intel reports that early tests have shown that Loihi 2 required more than 60 times fewer ops per inference when running deep neural networks as compared to Loihi 1 (without a loss in accuracy).

Intel Unveils Loihi 2, Its Second-Generation Neuromorphic Chip, HPCWire

Displaying Intel-MPI Debug Information

The Detailed Information can be found at Displaying MPI Debug Information

The I_MPI_DEBUG environment variable provides a convenient way to get detailed information about an MPI application at runtime. You can set the variable value from 0 (the default value) to 1000. The higher the value, the more debug information you get.

High values of I_MPI_DEBUG can output a lot of information and significantly reduce performance of your application. A value of I_MPI_DEBUG=5 is generally a good starting point, which provides sufficient information to find common errors.

Displaying MPI Debug Information

To redirect the debug information output from stdout to stderr or a text file, use the I_MPI_DEBUG_OUTPUT environment variable

$ mpirun -genv I_MPI_DEBUG=5 -genv I_MPI_DEBUG_OUTPUT=debug_output.txt -n 32 ./mpi_program

I_MPI_DEBUG Arguments

<level>Indicate the level of debug information provided.
0Output no debugging information. This is the default value.
1Output libfabric* version and provider.
2Output information about the tuning file used.
3Output effective MPI rank, pid and node mapping table.
4Output process pinning information.
5Output environment variables specific to the Intel® MPI Library.
> 5Add extra levels of debug information.
<flags>Comma-separated list of debug flags
pidShow process id for each debug message.
tidShow thread id for each debug message for multithreaded library.
timeShow time for each debug message.
datetimeShow time and date for each debug message.
hostShow host name for each debug message.
levelShow level for each debug message.
scopeShow scope for each debug message.
lineShow source line number for each debug message.
fileShow source file name for each debug message.
nofuncDo not show routine name.
norankDo not show rank.
nousrwarnSuppress warnings for improper use case (for example, incompatible combination of controls).
flockSynchronize debug output from different process or threads.
nobufDo not use buffered I/O for debug output.


  1. Displaying MPI Debug Information
  2. Developer Reference: I_MPI_DEBUG

Installing Intel® oneAPI AI Analytics Toolkit

What is included in the Intel oneAPI AI Analytics Toolkit? For more information, do take a look at Intel OneAPI Al Analytics Toolkit

  • Intel® Distribution for Python*
  • Intel® Distribution of Modin* (via Anaconda distribution of the toolkit using the Conda package manager)
  • Intel® Low Precision Optimization Tool
  • Intel® Optimization for PyTorch*
  • Intel® Optimization for TensorFlow*
  • Model Zoo for Intel® Architecture
  • Download size: 2.18 GB
  • Date: August 2, 2021
  • Version: 2021.3

Command Line Installation

wget https://registrationcenter-download.intel.com/akdlm/irc_nas/18040/l_AIKit_p_2021.3.0.1370_offline.sh

sudo bash l_AIKit_p_2021.3.0.1370_offline.sh

Installation Instruction

Step 1: From the console, locate the downloaded install file.
Step 2: Use $ sudo sh ./<installer>.sh to launch the GUI Installer as the root.
Optionally, use $ sh ./<installer>.sh to launch the GUI Installer as the current user.
Step 3: Follow the instructions in the installer.
Step 4: Explore the Get Started Guide.


  1. Intel OneAPI Al Analytics Toolkit

Installing Intel OneAPI HPC Toolkit for Linux

What is included in the OneAPI Installer? For more information, do take a look at Get the Intel® oneAPI HPC Toolkit

  • Intel® oneAPI DPC++/C++ Compiler
  • Intel® oneAPI Fortran Compiler
  • Intel® C++ Compiler Classic
  • Intel® Cluster Checker
  • Intel® Inspector
  • Intel® MPI Library
  • Intel® Trace Analyzer and Collector
  • Download size: 1.25 GB
  • Version: 2021.3
  • Date: June 21, 2021
wget https://registrationcenter-download.intel.com/akdlm/irc_nas/17912/l_HPCKit_p_2021.3.0.3230_offline.sh

sudo bash l_HPCKit_p_2021.3.0.3230_offline.sh

Installation Instruction:

  • Step 1: From the console, locate the downloaded install file.
  • Step 2: Use $ sudo sh ./<installer>.sh to launch the GUI Installer as root.
    Optionally, use $ sh ./<installer>.sh to launch the GUI Installer as current user.
  • Step 3: Follow the instructions in the installer.
  • Step 4: Explore the Get Started Guide.


Intel Ponte Vecchio playing Catch Up with AMD and Nvidia

Intel recently announced details on their forthcoming data center GPU, the Xe HPC, code named Ponte Vecchio (PVC). Intel daringly implied that the peak performance of the PVC GPU would be roughly twice that of today’s fastest GPU, the Nvidia A100. PVC and Sapphire Rapids (the multi-tile next-gen Xeon) are being used to build Aurora, the Argonne National Lab’s Exascale supercomputer, in 2022, so this technology should finally be just around the corner.

Intel is betting on this first-generation datacenter GPU for HPC to finally catch up with Nvidia and AMD, both for HPC (64-bit floating point) and AI (8 and 16-bit integer and 16-bit floating point). The Xe HPC device is a multi-tiled, multi-process-node package with new GPU cores, HBM2e memory, a new Xe Link interconnect, and PCIe Gen 5 implemented with over 100-billion transistors. That is nearly twice the size of the 54-billion Nvidia A100 chip. At that size, power consumption could be an issue at high frequencies. Nonetheless, the Xe design clearly demonstrates that Intel gets it; packaging smaller dies helps reduce development and manufacturing costs, and can improve time to market.

Intel Lays Down The Gauntlet For AMD And Nvidia GPUs by Frobes

oneAPI DevSummit, Asia-Pacific and Japan

This one-day, LIVE virtual conference features talks, panels, and a hands-on learning experience focused on using oneAPI, DPC++, and AI/ML to accelerate performance of cross-architecture workloads (CPU, GPU, FPGA, and other accelerators).

Register now to:

  • Connect with fellow developers and innovators.
  • Learn about the latest developer tools for oneAPI.
  • Hear from thought leaders in industry and academia who are working on innovative cross-platform, multi-vendor oneAPI solutions.
  • Discover real world projects using oneAPI to accelerate data science and AI pipelines.
  • Dive into a hands-on session on Intel® oneAPI toolkits for HPC and AI applications.
  • Join a vibrant community supporting each other using oneAPI, DPC++ and AI.

To Register

Full Scheduled Event

Intel Add AI/ML Improvements to Sapphire Rapids with AMX

The full article is taken from With AMX, Intel Adds AI/ML Sparkle to Sapphire Rapids

Picture taken from With AMX, Intel Adds AI/ML Sparkle to Sapphire Rapids

The best way for now to think of AMX is that it’s a matrix math overlay for the AVX-512 vector math units, as shown below. We can think of it like a “TensorCore” type unit for the CPU. The details about what this is were only a short snippet of the overall event, but it at least gives us an idea of how much space Intel is granting to training and inference specifically.

Data comes directly into the tiles while at the same time, the host hops ahead and dispatches the loads for the toles. TMUL operates on data the moment it’s ready. At the end of each multiplication round, the tiles move to cache and SIMD post-processing and storing. The goal on the software side is to make sure both the host and AMX unit are running simultaneously.

The prioritization for AMX toward real-world AI workloads also meant a reckoning for how users were considering training versus inference. While the latency and programmability benefits of having training stay local are critical, and could well be a selling point for scalable training workloads on the CPU, inference has been the sweet spot for Intel thus far and AMX caters to that realization.

From The Next Platform “With AMX, Intel Adds AI/ML Sparkle to Sapphire Rapids”