Enterprise Container Management Solutions – SuSE Rancher

What is SuSE Rancher?

Website: https://www.rancher.com/ (by SuSE)

Rancher Labs builds innovative, open source container management solutions for enterprises leveraging containers to accelerate software development and improve IT operations. The flagship product, Rancher, is a complete container management platform that makes it easy to manage all aspects of running containers in development and production environments, on any infrastructure. RancherOS is a minimalist Linux distribution which is perfect for running Docker containers at scale.

View on-demand recordings of past Rancher demos, online meetups, and Kubernetes tutorials at Rancher Youtube Channel. ( Rancher Labs )

This guide walks you through the process of adopting an enterprise container management platform (Dummies e-copy).

This guide will help security teams understand the attack surface for Kubernetes deployments and how attackers can exploit vulnerabilities. Get the e-copy of the Ultimate Guide to Kubernetes Security

Could not load the Qt platform plugin “xcb” in “” even though it was found in Rocky Linux 8

I was installing ANSYS 2023 R2 on Rocky Linux 8 and I encoutnered the error

qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "" even though it was found.
This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem.

Available platform plugins are: xcb.

Solution

# dnf install python3-qt5

Mounting NTFS on Rocky Linux 8

If you are planning to mount like a portable drive using Windows NTFS File System on the Rocky Linux 8, what you will see immediately when you issue the command after you plug the portable drive in

# mount /dev/sdd1 /data1
mount: /data1: unknown filesystem type 'ntfs'.

Step 1: Enable EPEL Repo

# dnf install epel-release

Step 2: Install NTFS-3g

# dnf install ntfs-3g

In some blogs written elsewhere, these 2 packages are more than enough, but I was still having issues. In my situation, I need to put in 5 packages

Step 3: Install all NTFS-3g packages

# dnf install *ntfs*

This time it works for me.

Step 4: Simply mount (Hooray!)

 # mount /dev/sdd1 /data1

Enabling Nvidia Tesla 4 x A100 with NVLink for MPI

I was having issues with the Applications like NetKET to detect and enable MPI.

Diagnosis

  1. I have installed OpenMPI and enabled CUDA during the configuration.
  2. CUDA Libraries including nvidia-smi has been installed without issue. But running, nvidia-smi topo –matrix, I am not able to see NVLink similar to

In fact, when I run NetKet on CUDA with MPI, the error that was generated was

mpirun noticed that process rank 0 with PID 0 on node gpu1 exited on signal 11 (Segmentation fault)."

Solution

This forum entry provided some enlightenment. https://forums.developer.nvidia.com/t/cuda-initialization-error-on-8x-a100-gpu-hgx-server/250936

The solution was to disable the Multi-instance GPU Mode which is enabled by default. Reboot the Server and it should see

nvidia-smi -mig 0

Enabling Persistence Mode

Make sure the configuration stays after a reboot.

# systemctl enable nvidia-persistenced.service
# systemctl start nvidia-persistenced.service

Basic use of nvidia-smi commands

There is a very good article written by Microway on this utility. Take a look at nvidia-smi: Control Your GPUs

What is nvidia-smi?

nvidia-smi is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.

Installation

Do take a look at NVIDIA CUDA Installation Guide for Linux for more information

Query GPU Status

$ nvidia-smi -L

Query overall GPU usage with 1-second update intervals

$ nvidia-smi dmon

Query System/GPU Topology and NVLink

$ nvidia-smi topo --matrix
$ nvidia-smi nvlink --status

Query Details of GPU Cards

$ nvidia-smi -i 0 -q

Basic Use of Nvidia Data Centre GPU Manager (DCGM)

For more information, take a look at The NVIDIA® Data Center GPU Manager (DCGM) . According to the Information,

The NVIDIA® Data Center GPU Manager (DCGM) simplifies administration of NVIDIA Datacenter (previously “Tesla”) GPUs in cluster and datacenter environments. At its heart, DCGM is an intelligent, lightweight user space library/agent that performs a variety of functions on each host system:

  • GPU behavior monitoring
  • GPU configuration management
  • GPU policy oversight
  • GPU health and diagnostics
  • GPU accounting and process statistics
  • NVSwitch configuration and monitoring

This functionality is accessible programmatically though public APIs and interactively through CLI tools. It is designed to be run either as a standalone entity or as an embedded library within management tools. This document is intended as an overview of DCGM’s main goals and features and is intended for system administrators, ISV developers, and individual users managing groups of NVIDIA GPUs.

Installation

Assuming you are using RHEL Derivative like Rocky Linux 8, installation is a breeze

# dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-rhel8.repo
# dnf install -y datacenter-gpu-manager

Enable the DCGM systemd service (on reboot) and start it now

# systemctl --now enable nvidia-dcgm
# systemctl start nvidia-dcgm

Basic Usage – Discovery

#  dcgmi discovery -l

Basic Usage – Diagnostic

To run a diagnostic test, you can use the command. You can decide on the level of diagostic. For example,

# dcgmi diag -r 2

If you want a more comprehensive diagnostic, you can use the command, you can use -r 3

# dcgmi diag -r 3

Basic Usage – NVLink Status

# dcgmi nvlink -s

Allowing users to bypass PBS-Professional Scheduler to SSH directly into the Compute Node

For some special users like Adminsitrators, who needs to SSH directly into the compute instead of submitting to the scheduler with using root, you may want to do the following:

At the Compute Node

# vim /var/spool/pbs/mom_priv/config

Find the $restrict_user_exceptions

$clienthost 192.168.x.x
$clienthost 192.168.y.y
$restrict_user_maxsysid 999
$restrict_user True
$restrict_user_exceptions user1
$usecp *:/home/ /home/

Restart the PBS Service

# /etc/init.d/pbs restart

Mounting and Unmounting NFS File Systems Using Ansible: Essential Tutorial

You can use Ansible to automate the configuration of NFS Client Settings

1. Mount an NFS File system, and configure in /etc/fstab

Use state: mounted

- name: Mount NFS Share nfs-server:/usr/local
  ansible.posix.mount:
      src: nfs-server:/usr_local
      path: /usr/local
      fstype: nfs
      opts: rw,nconnect=16,nfsvers=3,tcp,hard,intr,timeo=600,retrans=2,rsize=524288,wsize=524288
      state: mounted

2. Unmount an NFS File System, but not leave /etc/fstab unmodified

Use state: unmounted

- name: Mount NFS Share nfs-server:/usr/local
  ansible.posix.mount:
      src: nfs-server:/usr_local
      path: /usr/local
      fstype: nfs
      opts: rw,nconnect=16,nfsvers=3,tcp,hard,intr,timeo=600,retrans=2,rsize=524288,wsize=524288
      state: unmounted

3. Umount an NFS File System, and remove settings from /etc/fstab

Use state: absent

- name: Mount NFS Share nfs-server:/usr/local
  ansible.posix.mount:
      src: nfs-server:/usr_local
      path: /usr/local
      fstype: nfs
      opts: rw,nconnect=16,nfsvers=3,tcp,hard,intr,timeo=600,retrans=2,rsize=524288,wsize=524288
      state: absent

4. Remount an NFS System, without chaning /etc/fstab

Use state: remounted

- name: Mount NFS Share nfs-server:/usr/local
  ansible.posix.mount:
      src: nfs-server:/usr_local
      path: /usr/local
      fstype: nfs
      opts: rw,nconnect=16,nfsvers=3,tcp,hard,intr,timeo=600,retrans=2,rsize=524288,wsize=524288
      state: remount

References:

  1. ansible.posix.mount module – Control active and configured mount points
  2. Mounting and un-mounting a volume in ansible:

Using Bash alias to keep things simple

Alias are very useful tools to create shorthand pseudonyms to run the command you want without typing the whole thing.

Display all alias names

Typing alias the whole command will

$ alias
alias cp='cp -i'
alias egrep='egrep --color=auto'
alias fgrep='fgrep --color=auto'
alias grep='grep --color=auto'
alias l.='ls -d .* --color=auto'
alias ll='ls -l --color=auto'
alias ls='ls --color=auto'
alias mv='mv -i'
alias rm='rm -i'
alias xzegrep='xzegrep --color=auto'
alias xzfgrep='xzfgrep --color=auto'
alias xzgrep='xzgrep --color=auto'
alias zegrep='zegrep --color=auto'
alias zfgrep='zfgrep --color=auto'
alias zgrep='zgrep --color=auto'

Useful Alias to consider

alias ls='ls --color=auto'
alias egrep='egrep --color=auto'
alias fgrep='fgrep --color=auto'
alias grep='grep --color=auto'
alias mv='mv -i'
alias rm='rm -i'
alias cp='cp -i'

Remove alias

Just use the command unalias

unalias ll 

Retrieving OpenMPI Configuration

If you need to find out the information on your configuration setting, you may want to use the below commands

$ ./ompi_info -all|grep 'command line'
 Configure command line: '--prefix=/build-result/hpcx-v2.16-gcc-mlnx_ofed-redhat8-cuda12-gdrcopy2-nccl2.18-x86_64/ompi' '--with-libevent=internal' '--enable-mpi1-compatibility' '--without-xpmem' '--with-cuda=/hpc/local/oss/cuda12.1.1' '--with-slurm' '--with-platform=contrib/platform/mellanox/optimized' '--with-hcoll=/build-result/hpcx-v2.16-gcc-mlnx_ofed-redhat8-cuda12-gdrcopy2-nccl2.18-x86_64/hcoll' '--with-ucx=/build-result/hpcx-v2.16-gcc-mlnx_ofed-redhat8-cuda12-gdrcopy2-nccl2.18-x86_64/ucx' '--with-ucc=/build-result/hpcx-v2.16-gcc-mlnx_ofed-redhat8-cuda12-gdrcopy2-nccl2.18-x86_64/ucc'

If you wish to look at the full configuration

$ ./ompi_info
Package: Open MPI root@hpc-kernel-03 Distribution
Open MPI: 4.1.5rc2
Open MPI repo revision: v4.1.5rc1-17-gdb10576f40
Open MPI release date: Unreleased developer copy
Open RTE: 4.1.5rc2
Open RTE repo revision: v4.1.5rc1-17-gdb10576f40
Open RTE release date: Unreleased developer copy
OPAL: 4.1.5rc2
OPAL repo revision: v4.1.5rc1-17-gdb10576f40
OPAL release date: Unreleased developer copy
MPI API: 3.1.0
Ident string: 4.1.5rc2
Prefix: /usr/local/hpcx-v2.16-gcc-mlnx_ofed-redhat8-cuda12-gdrcopy2-nccl2.18-x86_64/ompi
.....
.....
.....