Heterogeneous computing comes with the challenge of designing code that can work in multi-processor/accelerator environments. Developers need to be equipped with the right set of metrics to make informed design and optimization decisions that take advantage of target hardware.
In Part 1 of this 2-part webinar series, Technical Consulting Engineer Cory Levels focuses on designing software for efficient offload from CPUs to GPUS—even before final hardware is available—using Intel® Advisor. Using a walkthrough of an ISO 3DFD example (3D isotropic Finite Difference), you will learn how to:
Optimize your CPU application for memory and compute
Identify efficient GPU offload opportunities and quantify the potential performance speed up
See performance headroom of your GPU offloaded code against hardware limitations, and get insights for an effective optimization roadmap
For More information, do take a look at the Intel Site Here.
Various processors and pieces of code are often compared to brains, but neuromorphic chips work to much more directly mimic neurological systems through the use of computational “neurons” that communicate with one another. Intel’s first-generation Loihi chip, introduced in 2017, has around 128,000 of those digital neurons. Over the ensuing four years, Loihi has been packed into increasingly large systems, learned to touch and even been taught to smell.
Now, it’s getting a new family member: Loihi 2. In its press release, Intel said that years of testing with the first-generation Loihi chip helped them to design a second generation with up to ten times the processing speed; up to 15 times greater resource density; and up to a million computational neurons per chip – more than seven times those in the first generation. Intel reports that early tests have shown that Loihi 2 required more than 60 times fewer ops per inference when running deep neural networks as compared to Loihi 1 (without a loss in accuracy).
The I_MPI_DEBUG environment variable provides a convenient way to get detailed information about an MPI application at runtime. You can set the variable value from 0 (the default value) to 1000. The higher the value, the more debug information you get.
High values of I_MPI_DEBUG can output a lot of information and significantly reduce performance of your application. A value of I_MPI_DEBUG=5 is generally a good starting point, which provides sufficient information to find common errors.
Step 1: From the console, locate the downloaded install file. Step 2: Use $ sudo sh ./<installer>.sh to launch the GUI Installer as the root. Optionally, use $ sh ./<installer>.sh to launch the GUI Installer as the current user. Step 3: Follow the instructions in the installer. Step 4: Explore the Get Started Guide.
Intel recently announced details on their forthcoming data center GPU, the Xe HPC, code named Ponte Vecchio (PVC). Intel daringly implied that the peak performance of the PVC GPU would be roughly twice that of today’s fastest GPU, the Nvidia A100. PVC and Sapphire Rapids (the multi-tile next-gen Xeon) are being used to build Aurora, the Argonne National Lab’s Exascale supercomputer, in 2022, so this technology should finally be just around the corner.
Intel is betting on this first-generation datacenter GPU for HPC to finally catch up with Nvidia and AMD, both for HPC (64-bit floating point) and AI (8 and 16-bit integer and 16-bit floating point). The Xe HPC device is a multi-tiled, multi-process-node package with new GPU cores, HBM2e memory, a new Xe Link interconnect, and PCIe Gen 5 implemented with over 100-billion transistors. That is nearly twice the size of the 54-billion Nvidia A100 chip. At that size, power consumption could be an issue at high frequencies. Nonetheless, the Xe design clearly demonstrates that Intel gets it; packaging smaller dies helps reduce development and manufacturing costs, and can improve time to market.
This one-day, LIVE virtual conference features talks, panels, and a hands-on learning experience focused on using oneAPI, DPC++, and AI/ML to accelerate performance of cross-architecture workloads (CPU, GPU, FPGA, and other accelerators).
Register now to:
Connect with fellow developers and innovators.
Learn about the latest developer tools for oneAPI.
Hear from thought leaders in industry and academia who are working on innovative cross-platform, multi-vendor oneAPI solutions.
Discover real world projects using oneAPI to accelerate data science and AI pipelines.
Dive into a hands-on session on Intel® oneAPI toolkits for HPC and AI applications.
Join a vibrant community supporting each other using oneAPI, DPC++ and AI.
Intel started a brand new architecture, built for scalability and designed to take advantage of the most advanced silicon technologies: Xe HPC. With incredible hardware like Ponte Vecchio and an open, standards-based software stack in oneAPI, Intel is already seeing leadership performance in AI workloads like ResNet-50.
The best way for now to think of AMX is that it’s a matrix math overlay for the AVX-512 vector math units, as shown below. We can think of it like a “TensorCore” type unit for the CPU. The details about what this is were only a short snippet of the overall event, but it at least gives us an idea of how much space Intel is granting to training and inference specifically.
Data comes directly into the tiles while at the same time, the host hops ahead and dispatches the loads for the toles. TMUL operates on data the moment it’s ready. At the end of each multiplication round, the tiles move to cache and SIMD post-processing and storing. The goal on the software side is to make sure both the host and AMX unit are running simultaneously.
The prioritization for AMX toward real-world AI workloads also meant a reckoning for how users were considering training versus inference. While the latency and programmability benefits of having training stay local are critical, and could well be a selling point for scalable training workloads on the CPU, inference has been the sweet spot for Intel thus far and AMX caters to that realization.
From The Next Platform “With AMX, Intel Adds AI/ML Sparkle to Sapphire Rapids”