Intel Accelerates Process and Packaging Innovations

Taken from Youtube – Intel NewRoom

During the “Intel Accelerated” webcast, Intel’s technology leaders revealed one of the most detailed process and packaging technology roadmaps the company has provided. The event on July 26, 2021, showcased a series of foundational innovations that will power products through 2025 and beyond. As part of the presentations, Intel announced RibbonFET, its first new transistor architecture in more than a decade, and PowerVia, an industry-first new backside power delivery method. (Credit: Intel Corporation)

Delays in Intel Sapphire Rapids Production to Q1 2022

From the Next Platform

Intel has delayed production of its next-generation Xeon Scalable CPUs, code-named Sapphire Rapids, to the first quarter of 2022 and said it will start ramping shipments by at least April of next year.

Spelman said Intel is delaying Sapphire Rapids, the 10-nanometer successor to the recently launched Ice Lake server processors, because of extra time needed to validate the CPU.

“Given the breadth of enhancements in Sapphire Rapids, we are incorporating additional validation time prior to the production release, which will streamline the deployment process for our customers and partners. Based on this, we now expect Sapphire Rapids to be in production in the first quarter of 2022, with ramp beginning in the second quarter of 2022,” Spelman wrote.

CRN (Intel Delays Sapphire Rapids Xeon CPU Production To Q1 2022)

For more information, do read Intel Delays Sapphire Rapids Xeon CPU Production To Q1 2022

References:

Intel Video-on-Demand at ISC21

Intel at ISC21

For more Information, do take a look at https://hpcevents.intel.com/lobby

Selected Videos.

Accelerating the Possibilities with HPC
by Trish Damkroger, VP and GM, High Performance Computing Group, Intel Corporation

Building HPC Systems with Intel for Today and Tomorrow
Jeff Watters, Director of the HPC Portfolio and Strategic Engagements, Intel Corporation

CXL Fireside Chat
Stephen Van Doren, Intel Fellow, Director of Processor Interconnect Architecture, Intel Corporation

Intel System Server D50TNP for HPC
Scott Misage, Manager Product Development & Architecture, Intel Corporation
Brian Caslis, Product Line Manager, Intel Corporation
Jim Russell, Project Design Manager, Intel Corporation

Ice Lake, Together with Mellanox Interconnect Solutions, Deliver Best in Class Performance for HPC Applications
Gilad Shainer, Senior Vice President Marketing, NVIDIA

Optimizing a Memory-Intensive Simulation Code for Heterogenous Optane Memory Systems
Steffen Christgau, HPC Consultant, Zuse Institute Berlin

For more Video, see https://hpcevents.intel.com/sessions

Webinar – Advanced code optimization for 3rd Gen Intel® Xeon® Scalable Processors.

An interesting seminar which you may be interested to signed up Free-Of-Charge. The Registration site can be found at https://techdecoded.intel.io/webinar-registration/upcoming-webinars/

If the data center is part of your development wheelhouse, you’re likely familiar with a little CPU called “Xeon”. This webinar unpacks the latest methodologies of tuning complex AI and HPC workloads for the third generation Xeon platform (formerly code-named Ice Lake).

Delivering up to 40 cores per processor, 3rd Gen Intel® Xeon® Scalable processors are designed for compute-intense, data-centric workloads spanning the cloud to the network and the edge.

In this session, Intel engineer Vladimir Tsymbal will show you how to optimize your AI and HPC applications and solutions to unlock the full spectrum of these processors’ power. You’ll learn:

  • The top-down tuning methodology that uses Xeon hardware-performance metrics to identify issues including critical bottlenecks caused by data locality, CPU interconnect bandwidth, cache limitations, instructions execution stalls, and I/O interfaces
  • How a high-level HPC Characterization Analysis helps you find inefficient parallel tasks
  • How to optimize software that uses the latest Intel® DL Boost VNNI instructions

Intel MPI Library Over Libfabric*

Taken from Intel Performance Libraries, Intel® MPI Library Over Libfabric*

What is Libfabric?

Libfabric is a low-level communication abstraction for high-performance networks. It hides most transport and hardware implementation details from middleware and applications to provide high-performance portability between diverse fabrics.

Using the Intel MPI Library Distribution of Libfabric

By default, mpivars.sh sets the environment to the version of libfabric shipped with the Intel MPI Library. To disable this, use the I_MPI_OFI_LIBRARY_INTERNAL environment variable or -ofi_internal (by default ofi_internal=1)

# source /usr/local/intel/2018u3/impi/2018.3.222/bin64/mpivars.sh -ofi_internal=1
# I_MPI_DEBUG=4 mpirun -n 1 IMB-MPI1 barrier
[0] MPI startup(): libfabric version: 1.7.2a-impi
[0] MPI startup(): libfabric provider: verbs;ofi_rxm
[0] MPI startup(): Rank    Pid      Node name   Pin cpu
[0] MPI startup(): 0       130358   hpc-n1  {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,
                                  30,31}
#------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2019 Update 4, MPI-1 part
#------------------------------------------------------------
# Date                  : Thu May 20 12:57:03 2021
# Machine               : x86_64
# System                : Linux
# Release               : 3.10.0-693.el7.x86_64
# Version               : #1 SMP Tue Aug 22 21:09:27 UTC 2017
# MPI Version           : 3.1
# MPI Thread Environment:


# Calling sequence was:

# IMB-MPI1 barrier

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# Barrier

#---------------------------------------------------
# Benchmarking Barrier
# #processes = 1
#---------------------------------------------------
 #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
         1000         0.08         0.08         0.08


# All processes entering MPI_Finalize

Changing the -ofi_internal=0

# source /usr/local/intel/2018u3/impi/2018.3.222/bin64/mpivars.sh -ofi_internal=0
# I_MPI_DEBUG=4 mpirun -n 1 IMB-MPI1 barrier
[0] MPI startup(): libfabric version: 1.1.0-impi
[0] MPI startup(): libfabric provider: mlx
.....
.....

Common OFI Controls

To select the OFI provider from the libfabric library, you can use definte the name of the OFI Provider to load

export I_MPI_OFI_PROVIDER=tcp

Logging Interfaces

FI_LOG_LEVEL=<level> controls the amount of logging data that is output. The following log levels are defined:

  • Warn: Warn is the least verbose setting and is intended for reporting errors or warnings.
  • Trace: Trace is more verbose and is meant to include non-detailed output helpful for tracing program execution.
  • Info: Info is high traffic and meant for detailed output.
  • Debug: Debug is high traffic and is likely to impact application performance. Debug output is only available if the library has been compiled with debugging enabled.

References:

  1. Intel® MPI Library Over Libfabric*
  2. New MPI error with Intel 2019.1, unable to run MPI hello world

Analyzing Memory and Threading Correctness for GPU-Offloaded Code

Modern workloads are diverse—and so are architectures. No single architecture is best for every workload. Maximizing performance takes a mix of scalar, vector, matrix, and spatial architectures deployed in CPU, GPU, FPGA, and other future accelerators. Heterogeneity adds complexity that can be difficult to debug. This article introduces the new features of Intel® Inspector that support the analysis of code that’s offloaded to accelerators.

For more information: Analyzing Memory and Threading Correctness for GPU-Offloaded Code

INTEL® FPGA PAC can Filter, Aggregate, Sort, and Convert files faster than software alone

This article is taken from DATA PROCESSING TESTS BY NTT DATA SUGGEST THAT AN INTEL® FPGA PAC CAN FILTER, AGGREGATE, SORT, AND CONVERT FILES 4X FASTER THAN SOFTWARE ALONE from Intel

Nearly 80% of total data processing time is spent on tasks such as filtering, aggregation, sorting, and format conversion. NTT Data conducted proof-of-concept tests aimed at improving data processing performance for these tasks. The tests employed an Intel® FPGA Programmable Acceleration Card (Intel® FPGA PAC) to process Linux audit logs, resulting in processing speeds more than four times faster than the same processing done in exclusively in software.

Two factors drove this exercise:

  1. The advent of Intel FPGA PACs and other associated technologies have now made it far easier for companies to incorporate FPGAs as processing elements in data center servers.
  2. HLS technology—which enables engineers to use programming languages with C-like syntaxes for application development targeting FPGAs—makes it easier for software engineers to develop applications that target FPGAs.

For more information, do take a look at DATA PROCESSING TESTS BY NTT DATA SUGGEST THAT AN INTEL® FPGA PAC CAN FILTER, AGGREGATE, SORT, AND CONVERT FILES 4X FASTER THAN SOFTWARE ALONE