oneAPI DevSummit, Asia-Pacific and Japan

This one-day, LIVE virtual conference features talks, panels, and a hands-on learning experience focused on using oneAPI, DPC++, and AI/ML to accelerate performance of cross-architecture workloads (CPU, GPU, FPGA, and other accelerators).

Register now to:

  • Connect with fellow developers and innovators.
  • Learn about the latest developer tools for oneAPI.
  • Hear from thought leaders in industry and academia who are working on innovative cross-platform, multi-vendor oneAPI solutions.
  • Discover real world projects using oneAPI to accelerate data science and AI pipelines.
  • Dive into a hands-on session on Intel® oneAPI toolkits for HPC and AI applications.
  • Join a vibrant community supporting each other using oneAPI, DPC++ and AI.

To Register

Full Scheduled Event

Intel Add AI/ML Improvements to Sapphire Rapids with AMX

The full article is taken from With AMX, Intel Adds AI/ML Sparkle to Sapphire Rapids

Picture taken from With AMX, Intel Adds AI/ML Sparkle to Sapphire Rapids

The best way for now to think of AMX is that it’s a matrix math overlay for the AVX-512 vector math units, as shown below. We can think of it like a “TensorCore” type unit for the CPU. The details about what this is were only a short snippet of the overall event, but it at least gives us an idea of how much space Intel is granting to training and inference specifically.

Data comes directly into the tiles while at the same time, the host hops ahead and dispatches the loads for the toles. TMUL operates on data the moment it’s ready. At the end of each multiplication round, the tiles move to cache and SIMD post-processing and storing. The goal on the software side is to make sure both the host and AMX unit are running simultaneously.

The prioritization for AMX toward real-world AI workloads also meant a reckoning for how users were considering training versus inference. While the latency and programmability benefits of having training stay local are critical, and could well be a selling point for scalable training workloads on the CPU, inference has been the sweet spot for Intel thus far and AMX caters to that realization.

From The Next Platform “With AMX, Intel Adds AI/ML Sparkle to Sapphire Rapids”

Intel Accelerates Process and Packaging Innovations

Taken from Youtube – Intel NewRoom

During the “Intel Accelerated” webcast, Intel’s technology leaders revealed one of the most detailed process and packaging technology roadmaps the company has provided. The event on July 26, 2021, showcased a series of foundational innovations that will power products through 2025 and beyond. As part of the presentations, Intel announced RibbonFET, its first new transistor architecture in more than a decade, and PowerVia, an industry-first new backside power delivery method. (Credit: Intel Corporation)

Delays in Intel Sapphire Rapids Production to Q1 2022

From the Next Platform

Intel has delayed production of its next-generation Xeon Scalable CPUs, code-named Sapphire Rapids, to the first quarter of 2022 and said it will start ramping shipments by at least April of next year.

Spelman said Intel is delaying Sapphire Rapids, the 10-nanometer successor to the recently launched Ice Lake server processors, because of extra time needed to validate the CPU.

“Given the breadth of enhancements in Sapphire Rapids, we are incorporating additional validation time prior to the production release, which will streamline the deployment process for our customers and partners. Based on this, we now expect Sapphire Rapids to be in production in the first quarter of 2022, with ramp beginning in the second quarter of 2022,” Spelman wrote.

CRN (Intel Delays Sapphire Rapids Xeon CPU Production To Q1 2022)

For more information, do read Intel Delays Sapphire Rapids Xeon CPU Production To Q1 2022

References:

Intel Video-on-Demand at ISC21

Intel at ISC21

For more Information, do take a look at https://hpcevents.intel.com/lobby

Selected Videos.

Accelerating the Possibilities with HPC
by Trish Damkroger, VP and GM, High Performance Computing Group, Intel Corporation

Building HPC Systems with Intel for Today and Tomorrow
Jeff Watters, Director of the HPC Portfolio and Strategic Engagements, Intel Corporation

CXL Fireside Chat
Stephen Van Doren, Intel Fellow, Director of Processor Interconnect Architecture, Intel Corporation

Intel System Server D50TNP for HPC
Scott Misage, Manager Product Development & Architecture, Intel Corporation
Brian Caslis, Product Line Manager, Intel Corporation
Jim Russell, Project Design Manager, Intel Corporation

Ice Lake, Together with Mellanox Interconnect Solutions, Deliver Best in Class Performance for HPC Applications
Gilad Shainer, Senior Vice President Marketing, NVIDIA

Optimizing a Memory-Intensive Simulation Code for Heterogenous Optane Memory Systems
Steffen Christgau, HPC Consultant, Zuse Institute Berlin

For more Video, see https://hpcevents.intel.com/sessions

Webinar – Advanced code optimization for 3rd Gen Intel® Xeon® Scalable Processors.

An interesting seminar which you may be interested to signed up Free-Of-Charge. The Registration site can be found at https://techdecoded.intel.io/webinar-registration/upcoming-webinars/

If the data center is part of your development wheelhouse, you’re likely familiar with a little CPU called “Xeon”. This webinar unpacks the latest methodologies of tuning complex AI and HPC workloads for the third generation Xeon platform (formerly code-named Ice Lake).

Delivering up to 40 cores per processor, 3rd Gen Intel® Xeon® Scalable processors are designed for compute-intense, data-centric workloads spanning the cloud to the network and the edge.

In this session, Intel engineer Vladimir Tsymbal will show you how to optimize your AI and HPC applications and solutions to unlock the full spectrum of these processors’ power. You’ll learn:

  • The top-down tuning methodology that uses Xeon hardware-performance metrics to identify issues including critical bottlenecks caused by data locality, CPU interconnect bandwidth, cache limitations, instructions execution stalls, and I/O interfaces
  • How a high-level HPC Characterization Analysis helps you find inefficient parallel tasks
  • How to optimize software that uses the latest Intel® DL Boost VNNI instructions

Intel MPI Library Over Libfabric*

Taken from Intel Performance Libraries, Intel® MPI Library Over Libfabric*

What is Libfabric?

Libfabric is a low-level communication abstraction for high-performance networks. It hides most transport and hardware implementation details from middleware and applications to provide high-performance portability between diverse fabrics.

Using the Intel MPI Library Distribution of Libfabric

By default, mpivars.sh sets the environment to the version of libfabric shipped with the Intel MPI Library. To disable this, use the I_MPI_OFI_LIBRARY_INTERNAL environment variable or -ofi_internal (by default ofi_internal=1)

# source /usr/local/intel/2018u3/impi/2018.3.222/bin64/mpivars.sh -ofi_internal=1
# I_MPI_DEBUG=4 mpirun -n 1 IMB-MPI1 barrier
[0] MPI startup(): libfabric version: 1.7.2a-impi
[0] MPI startup(): libfabric provider: verbs;ofi_rxm
[0] MPI startup(): Rank    Pid      Node name   Pin cpu
[0] MPI startup(): 0       130358   hpc-n1  {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,
                                  30,31}
#------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2019 Update 4, MPI-1 part
#------------------------------------------------------------
# Date                  : Thu May 20 12:57:03 2021
# Machine               : x86_64
# System                : Linux
# Release               : 3.10.0-693.el7.x86_64
# Version               : #1 SMP Tue Aug 22 21:09:27 UTC 2017
# MPI Version           : 3.1
# MPI Thread Environment:


# Calling sequence was:

# IMB-MPI1 barrier

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# Barrier

#---------------------------------------------------
# Benchmarking Barrier
# #processes = 1
#---------------------------------------------------
 #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
         1000         0.08         0.08         0.08


# All processes entering MPI_Finalize

Changing the -ofi_internal=0

# source /usr/local/intel/2018u3/impi/2018.3.222/bin64/mpivars.sh -ofi_internal=0
# I_MPI_DEBUG=4 mpirun -n 1 IMB-MPI1 barrier
[0] MPI startup(): libfabric version: 1.1.0-impi
[0] MPI startup(): libfabric provider: mlx
.....
.....

Common OFI Controls

To select the OFI provider from the libfabric library, you can use definte the name of the OFI Provider to load

export I_MPI_OFI_PROVIDER=tcp

Logging Interfaces

FI_LOG_LEVEL=<level> controls the amount of logging data that is output. The following log levels are defined:

  • Warn: Warn is the least verbose setting and is intended for reporting errors or warnings.
  • Trace: Trace is more verbose and is meant to include non-detailed output helpful for tracing program execution.
  • Info: Info is high traffic and meant for detailed output.
  • Debug: Debug is high traffic and is likely to impact application performance. Debug output is only available if the library has been compiled with debugging enabled.

References:

  1. Intel® MPI Library Over Libfabric*
  2. New MPI error with Intel 2019.1, unable to run MPI hello world