Traditional methods for performing data reductions are very costly in terms of latency and CPU cycles. The NVIDIA Quantum InfiniBand switch with NVIDIA SHARP technology addresses complex operations such as data reduction in a simplified, efficient way. By reducing data within the switch network, NVIDIA Quantum switches perform the reduction in a fraction of the time of traditional methods.
Nvidia
Ganglia and Gmond Python module for GPUs
If you are running a cluster with NVIDIA GPUs, there now exists a python module for monitoring NVIDIA GPUs using the newly released Python bindings for NVML (NVIDIA Management Library). These bindings are under BSD license and allow simplified access to GPU metrics like temperature, memory usage, and utilization.
Nvidia Developer – Ganglia Monitoring System

To install the Ganglia plug-in on your Ganglia installation, see these download links:
For more information see:
Acknowledgements:
- Bernard Li (Lawrence Berkeley National Laboratory)
- Jeremy Enos (National Center for Supercomputing Applications)
Basic Commands for Mellanox Network Switches for Break-out-Ports
More information can be found at Command Line Interface (CLI)
Point 1: To configure Break-Out
> enable
# configure terminal
# interface ethernet ?
R2-R8-LEAF01 [standalone: master] (config) # interface ethernet ?
<Device/Port>[-<Device/Port>]
1/1/1
1/1/2
1/1/3
1/1/4
1/3/1
1/3/2
1/3/3
1/3/4
1/5/1
1/5/2
1/5/3
1/5/4
1/7/1
1/7/2
1/7/3
1/7/4
1/9/1
1/9/2
1/9/3
1/9/4
.....
.....
1/25
1/26
1/27
1/28
1/29
1/30
1/31
1/32
# interface ethernet 1/25 shutdown
# interface ethernet 1/26 shutdown
# interface ethernet 1/25
# (config interface ethernet 1/25) # module-type qsfp-split-4 force
The resulting interface will become
Ethernet 1/25/1
Ethernet 1/25/2
Ethernet 1/25/3
Ethernet 1/25/4
Speed configuration can be found at
interface ethernet 1/25/1
# speed 25G
Cannot install the best candidate for the job for CUDA Drivers and Rocky Linux 8.5
I follow the blog Installing Nvidia Drivers on Rocky Linux 8.5. But I encountered an error that I have not encountered before
Error:
Problem 1: package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64
Problem 2: package cuda-drivers-515.48.07-1.x86_64 requires nvidia-kmod >= 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64
Problem 3: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
- package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64
Problem 4: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
- package nvidia-modprobe-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
- package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64
Problem 5: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
- package nvidia-settings-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
- package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64
Problem 6: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
- package nvidia-xconfig-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
- package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64
The hint is that dkms is required.
nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
Enable EPEL Repository
# dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
# dnf config-manager --enable epel
Install dkms
# dnf install dkms*
Install the latest Nvidia Drivers (If possible).
# dnf module install nvidia-driver:latest
If the Error pop out like this
Last metadata expiration check: 0:01:01 ago on Mon 06 Jun 2022 08:47:40 PM EDT.
Error:
Problem 1: package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
Problem 2: package cuda-drivers-515.48.07-1.x86_64 requires nvidia-kmod >= 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
Problem 3: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
- package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
Problem 4: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
- package nvidia-modprobe-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
- package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
Problem 5: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
- package nvidia-settings-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
- package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
Problem 6: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
- package nvidia-xconfig-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
- package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
You will notice that the dkms issues has been resolved. Try not using the nvidia-driver:latest
# dnf module install nvidia-driver
===================================================================================================================================================
Package Architecture Version Repository Size
===================================================================================================================================================
Upgrading:
bcc x86_64 0.19.0-5.el8 appstream 674 k
bcc-tools x86_64 0.19.0-5.el8 appstream 447 k
bpftrace x86_64 0.12.1-4.el8 appstream 1.3 M
clang-libs x86_64 13.0.1-1.module+el8.6.0+825+7e27476a appstream 23 M
clang-resource-filesystem x86_64 13.0.1-1.module+el8.6.0+825+7e27476a appstream 13 k
compiler-rt x86_64 13.0.1-1.module+el8.6.0+825+7e27476a appstream 4.2 M
libglvnd x86_64 1:1.3.4-1.el8 appstream 126 k
libglvnd-egl x86_64 1:1.3.4-1.el8 appstream 48 k
libglvnd-gles x86_64 1:1.3.4-1.el8 appstream 39 k
libglvnd-glx x86_64 1:1.3.4-1.el8 appstream 136 k
libomp-devel x86_64 13.0.1-1.module+el8.6.0+825+7e27476a appstream 28 k
llvm-libs x86_64 13.0.1-1.module+el8.6.0+825+7e27476a appstream 24 M
mesa-dri-drivers x86_64 21.3.4-1.el8 appstream 11 M
mesa-filesystem x86_64 21.3.4-1.el8 appstream 33 k
mesa-libxatracker x86_64 21.3.4-1.el8 appstream 2.0 M
python3-bcc x86_64 0.19.0-5.el8 appstream 89 k
Installing group/module packages:
cuda-drivers x86_64 515.48.07-1 cuda-rhel8-x86_64 8.1 k
kmod-nvidia-latest-dkms x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 30 M
nvidia-driver x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 23 M
nvidia-driver-NVML x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 462 k
nvidia-driver-NvFBCOpenGL x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 54 k
nvidia-driver-cuda x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 455 k
nvidia-driver-cuda-libs x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 54 M
nvidia-driver-devel x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 13 k
nvidia-driver-libs x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 177 M
nvidia-kmod-common noarch 3:515.48.07-1.el8 cuda-rhel8-x86_64 13 k
nvidia-libXNVCtrl x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 26 k
nvidia-libXNVCtrl-devel x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 56 k
nvidia-modprobe x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 37 k
nvidia-persistenced x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 43 k
nvidia-settings x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 835 k
nvidia-xconfig x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 106 k
Installing dependencies:
dnf-plugin-nvidia noarch 2.0-1.el8 cuda-rhel8-x86_64 12 k
egl-wayland x86_64 1.1.9-3.el8 appstream 39 k
libX11-devel x86_64 1.6.8-5.el8 appstream 975 k
libXau-devel x86_64 1.0.9-3.el8 appstream 19 k
libglvnd-opengl x86_64 1:1.3.4-1.el8 appstream 46 k
libvdpau x86_64 1.4-2.el8 appstream 40 k
libxcb-devel x86_64 1.13.1-1.el8 appstream 1.1 M
mesa-vulkan-drivers x86_64 21.3.4-1.el8 appstream 6.7 M
ocl-icd x86_64 2.2.12-1.el8 appstream 50 k
opencl-filesystem noarch 1.0-6.el8 appstream 7.3 k
vulkan-loader x86_64 1.3.204.0-2.el8 appstream 133 k
xorg-x11-proto-devel noarch 2020.1-3.el8 appstream 279 k
Installing module profiles:
nvidia-driver/default
Enabling module streams:
nvidia-driver latest-dkms
.....
.....
Finally do a
# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... Off | 00000000:A3:00.0 Off | 0 |
| N/A 49C P0 46W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCI... Off | 00000000:C3:00.0 Off | 0 |
| N/A 53C P0 46W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
How Synthetic Data Supercharges Vision AI Development NVIDIA Webinar

In this meetup you’ll learn how synthetic data is transforming AI development efforts:
- Learn how to use NVIDIA’s Omniverse Replicator to quickly create synthetic data and how it can integrate with NVIDIA TAO training tools.
- Hear from Sky Engine AI, an NVIDIA synthetic data partner, sharing how you can leverage 3rd party synthetic data services.
- Get your questions answered in a live Q&A session with our team of experts.
Register here and select one of the following sessions:
- Americas, Europe, Middle East: Wednesday May 18 – 8am PT | 4PM CET
- Asia-Pacific: Thursday May 19 – 11am SST | 12pm JST/KST | ?8:30am IST
EOL notice for Mellanox ConnectX-5 VPI host channel adapters and Switch-IB 2 based EDR InfiniBand Switches
Nvidia Corporation has announced the EOL Notice #LCR-000906 – MELLANOX
PCN INFORMATION:
PCN Number: LCR-000906 – MELLANOX
PCN Description: EOL notice for Mellanox ConnectX-5 VPI host channel adapters and Switch-IB 2 based EDR InfiniBand Switches
Publish Date: Sun May 08 00:00:00 GMT 2022
Type: FYI
Installing Nvidia Drivers on Rocky Linux 8.5
If you are planning to install Nvidia Drivers on Rocky Linux 8.5, you may want to use DNF to install. For a detailed explanation Streamlining NVIDIA Driver Deployment on RHEL 8 with Modularity Streams
Step 1: Add Offical Nvidia Repository to Package Managers repository list.
# dnf config-manager --add-repo=https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
Step 2: Install Kernel-Devel and Headers used by the Drivers
# dnf install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
Step 3: Installing Nvidia Drivers and Settings
# dnf install nvidia-driver nvidia-settings
Step 4: Install CUDA Drivers and REboot
# dnf install cuda-driver
Once done, do a reboot,
# reboot
If after a reboot and if you do a “nvidia-smi” and receive an error like the one
# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
You may want to take a look at https://gist.github.com/espoirMur/65cec3d67e0a96e270860c9c276ab9fa. It could be coming Secure Boot Option in your BIOS.
GTC 2022 Keynote with NVIDIA CEO Jensen Huang
Webinar – Cloud-Native Supercomputing Powers New Data Centre Architecture

Computing power becomes the service. Data center becomes the new computing unit to serve the unlimited computing resource with high performance, flexibility and security. Network as the bridge between the computing resource and storage resource, between data centers and between the user and data center, is becoming the key to impact performance and security. The Cloud Native Supercomputing architecture is designed to leverage the advantage from both supercomputer and cloud to provide the best performance in the modern zero trust environment.
By attending this webinar, you will learn how to:
- Use the supercomputing technologies in data center
- Deliver the cloud flexibility with supercomputing technologies to drive the most powerful data center
- Provide the cloud native supercomputing service in zero trust environment
Date: February 23, 2022
Time: 15:00 – 16:00 SGT
Duration: 1 hour
To Register (Cloud Native Supercomputing Powers New Data Center Architecture (nvidianews.com)
nvidia-smi – failed to initialize nvml: insufficient permissions
The Error Encountered
If you are a non-root user and you issue a command, you might see the error
% nvidia-smi
NVML: Insufficient Permissions" error
The default module option NVreg_DeviceFileMode=0660 set via /etc/modprobe.d/nvidia-default.conf. This causes the nvidia device nodes to have 660 permission.
vim /etc/modprobe.d/nvidia-default.conf
options nvidia NVreg_DeviceFileUID=0 NVreg_DeviceFileGID=1001 NVreg_DeviceFileMode=0660
The Fix
[user1@node1 dev]$ ls -l nvidia*
crw-rw---- 1 root vglusers 195, 0 Jan 5 17:07 nvidia0
crw-rw---- 1 root vglusers 195, 255 Jan 5 17:07 nvidiactl
crw-rw---- 1 root vglusers 195, 254 Jan 5 17:07 nvidia-modeset
The reason for the error is due to the vglusers or video group. The fix is simply putting the users in the /etc/group
# usermod -a -G vglusers user1
Logged off and Login again, you should be able to do nvidia-smi
References: