August 28, 2023 by kittycool only

Disabling Avahi-Daemon on CentOS 7

I was having a bit of difficult of turning Avahi-Daemon which is also called mDNS since I do not need the Service. When I use the command

# systemctl status avahi-daemon
● avahi-daemon.service - Avahi mDNS/DNS-SD Stack
   Loaded: loaded (/usr/lib/systemd/system/avahi-daemon.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2023-08-28 08:46:26 +08; 14h ago
 Main PID: 36457 (avahi-daemon)
   Status: "avahi-daemon 0.6.31 starting up."
    Tasks: 2
   Memory: 676.0K
   CGroup: /system.slice/avahi-daemon.service
           ├─36457 avahi-daemon: running [hpc-r001.local]
           └─36494 avahi-daemon: chroot helper
.....
.....
.....

Unable to Stop ???

I tried to stop it, but the daemon did not stop….. Hmmmmm

# systemctl stop avahi-daemon
Warning: Stopping avahi-daemon.service, but it can still be activated by:
  avahi-daemon.socket
[root@hpc-r001 ~]# systemctl status avahi-daemon
● avahi-daemon.service - Avahi mDNS/DNS-SD Stack
   Loaded: loaded (/usr/lib/systemd/system/avahi-daemon.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2023-08-28 23:11:54 +08; 10s ago
 Main PID: 372559 (avahi-daemon)
   Status: "avahi-daemon 0.6.31 starting up."
    Tasks: 2
   Memory: 704.0K
   CGroup: /system.slice/avahi-daemon.service
           ├─372559 avahi-daemon: running [hpc-r001.local]
           └─372563 avahi-daemon: chroot helper

Unable to Disable ???

I tried to disable as well. But…… still alive?

# systemctl disable avahi-daemon
Removed symlink /etc/systemd/system/multi-user.target.wants/avahi-daemon.service.
Removed symlink /etc/systemd/system/dbus-org.freedesktop.Avahi.service.
Removed symlink /etc/systemd/system/sockets.target.wants/avahi-daemon.socket.
[root@hpc-r001 ~]# systemctl status avahi-daemon
● avahi-daemon.service - Avahi mDNS/DNS-SD Stack
   Loaded: loaded (/usr/lib/systemd/system/avahi-daemon.service; disabled; vendor preset: enabled)
   Active: active (running) since Mon 2023-08-28 23:12:25 +08; 2min 32s ago
 Main PID: 372707 (avahi-daemon)
   Status: "avahi-daemon 0.6.31 starting up."
   CGroup: /system.slice/avahi-daemon.service
           ├─372707 avahi-daemon: running [hpc-r001.local]
           └─372709 avahi-daemon: chroot helper

Finally…… Mask, Disable then stop.

To prevent a service from running you need to “mask” it first

# systemctl mask avahi-daemon
Created symlink from /etc/systemd/system/avahi-daemon.service to /dev/null.
# systemctl disable avahi-daemon
# systemctl stop avahi-daemon

# systemctl status avahi-daemon
● avahi-daemon.service
   Loaded: masked (/dev/null; bad)
   Active: inactive (dead) since Mon 2023-08-28 23:15:42 +08; 20min ago
 Main PID: 372707 (code=exited, status=0/SUCCESS)

August 28, 2023 by kittycool only

Using NMCLI to manage Network on Rocky Linux 8

Point 1: View all the saved connections

# nmcli connection show
ens1f0     XXXX-XXXX-XXXX-XXXX-XXXX  ethernet  ens1f0
ens1f1     YYYY-YYYY-YYYY-YYYY-YYYY  ethernet  ens1f1
ens10f0    ZZZZ-ZZZZ-ZZZZ-ZZZZ-ZZZZ  ethernet  --
ens10f1    AAAA-AAAA-AAAA-AAAA-AAAA  ethernet  --

Point 2a: Stop Network

You can use the command “nmcli connection down ssid/uuid". For example

# nmcli connection down XXXX-XXXX-XXXX-XXXX-XXXX
Connection 'eno0' successfully deactivated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/3)

Point 2b: Start Network

You can use the command “nmcli connection up ssid/uuid". For example

# nmcli connection up XXXX-XXXX-XXXX-XXXX-XXXX
Connection 'eno0' successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/3)

Point 3: Device Connection

To check the Device status

# nmcli dev status
ens1f0  ethernet  connected     ens1f0
eno1f1  ethernet  connected     ens1f1
eno10f0  ethernet  disconnected  --
eno10f1  ethernet  disconnected  --

Point 4: List all Device

# nmcli device show
GENERAL.DEVICE:                         ens1f0
GENERAL.TYPE:                           ethernet
GENERAL.HWADDR:                         XX:XX:XX:XX:XX:XX
GENERAL.MTU:                            1500
GENERAL.STATE:                          100 (connected)
GENERAL.CONNECTION:                     ens1f0
GENERAL.CON-PATH:                       /org/freedesktop/NetworkManager/ActiveConnection/2
WIRED-PROPERTIES.CARRIER:               on
IP4.ADDRESS[1]:                         192.168.0.1
IP4.GATEWAY:                            192.168.0.254
IP4.ROUTE[1]:                           dst = 0.0.0.0/0, nh = 192.168.0.254, mt = 101
IP4.ROUTE[2]:                           dst = 198.168.0.0/19, nh = 0.0.0.0, mt = 101
IP6.ADDRESS[1]:                         xxxx::xxxx:xxxx:xxxx:xxxx/64
IP6.GATEWAY:                            --
IP6.ROUTE[1]:                           dst = fe80::/64, nh = ::, mt = 1024

GENERAL.DEVICE:                         eno1f1
GENERAL.TYPE:                           ethernet
GENERAL.HWADDR:                         94:6D:AE:9B:76:1C
GENERAL.MTU:                            1500
GENERAL.STATE:                          100 (connected)
GENERAL.CONNECTION:                     eno1f1
GENERAL.CON-PATH:                       /org/freedesktop/NetworkManager/ActiveConnection/4
WIRED-PROPERTIES.CARRIER:               on
IP4.ADDRESS[1]:                         192.168.200.201/19
IP4.GATEWAY:                            --
IP4.ROUTE[1]:                           dst = 192.168.192.0/19, nh = 0.0.0.0, mt = 102
IP6.ADDRESS[1]:                         fe80::966d:aeff:fe9b:761c/64
IP6.GATEWAY:                            --
IP6.ROUTE[1]:                           dst = fe80::/64, nh = ::, mt = 1024

Point 5: Start and Stop Device

# nmcli con down ens1d1
# nmcli con up ens1d1

References:

nmcli command in Linux with Examples

August 24, 2023 by kittycool only

Having kernel hung_task_timeout_secs Issues

There is a good article Linux Kernel panic issue: How to fix hung_task_timeout_secs and blocked for more than 120 seconds problem which provides an explanation and solution to kernel hung_task_timeout_secs Issues.

By default Linux uses up to 40% of the available memory for file system caching. After this mark has been reached the file system flushes all outstanding data to disk causing all following IOs going synchronous. For flushing out this data to disk this there is a time limit of 120 seconds by default. In the case here the IO subsystem is not fast enough to flush the data withing 120 seconds. As IO subsystem responds slowly and more requests are served, System Memory gets filled up resulting in the above error, thus serving HTTP requests.
Linux Kernel panic issue: How to fix hung_task_timeout_secs and blocked for more than 120 seconds problem

Resolution

Change vm.dirty_ratio and vm.dirty_backgroud_ratio

# sysctl -w vm.dirty_ratio=10
# sysctl -w vm.dirty_background_ratio=5
# sysctl -p

If you wish to make it permanent, add the 2 lines to /etc/sysctl.conf

vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

August 11, 2023 by kittycool only

Automating the Linux Client Server for Centrify and 2FA on Rocky Linux 8

The whole manual setup including those on the Active Directory can be found at Preparing a Linux Client Server for Centrify and 2FA for CentOS-7

If you just want to automate the Linux portion, here is something you may wish to consider.

Update the sshd_config Templates (The most important portion is that the “PasswordAuthentication no” and “ChallengeResponseAuthentication yes” is present. The whole sshd_config template is too large for me to put into the blog.

.....
.....
# To disable tunneled clear text passwords, change to no here!
#PasswordAuthentication yes
#PermitEmptyPasswords no
PasswordAuthentication no

# Change to no to disable s/key passwords
#ChallengeResponseAuthentication yes
ChallengeResponseAuthentication yes
.....
.....

- name: Generate /etc/ssh/sshd_config from /etc/ssh/sshd_config.j2 template
  template:
      src: ../templates/sshd_config.j2
      dest: /etc/ssh/sshd_config
      owner: root
      group: root
      mode: 0600
  when:
    - ansible_os_family == "RedHat"
    - ansible_distribution_major_version == "8"

- name: Restart SSH Service
  systemd:
    name: sshd
    state: restarted
    enabled: yes
  when:
    - ansible_os_family == "RedHat"
    - ansible_distribution_major_version == "8"
  changed_when: false

Here is Centrify_2FA.yml to insert the IWaTrustRoot.pem certificate

- name: Copy IwaTrustRoot.pem to /etc/pki/ca-trust/source/anchors/
  template:
      src: /usr/local/software/certificate/IwaTrustRoot.pem
      dest: /etc/pki/ca-trust/source/anchors/
      owner: root
      group: root
      mode: 0600
  become: true
  when:
    - ansible_os_family == "RedHat"
    - ansible_distribution_major_version == "8"

- name: Copy IwaTrustRoot.pem to /var/centrify/net/certs
  template:
      src: /usr/local/software/certificate/IwaTrustRoot.pem
      dest: /var/centrify/net/certs
      owner: root
      group: root
      mode: 0600
  become: true
  when:
    - ansible_os_family == "RedHat"
    - ansible_distribution_major_version == "8"

Restart the CentrifyDC and do a Flush so that the AD is updated.

- name: CentrifyDC Restart
  ansible.builtin.shell: "/usr/share/centrifydc/bin/centrifydc restart"
  register: centrifydc_status
  changed_when: false

- name: Active Directory Flush
  ansible.builtin.shell: "adflush -f"
  register: flush_status
  changed_when: false

- name: Centrify Service Restarted
  debug:
    msg: "Load Average: {{ centrifydc_status.stdout }}"

August 7, 2023 by kittycool only

Azure looks like a house of cards collapsing under the weight of exploits and vulnerabilities

Taken from Microsoft comes under blistering criticism for “grossly irresponsible” security – ars Technica

Microsoft has once again come under blistering criticism for the security practices of Azure and its other cloud offerings, with the CEO of security firm Tenable saying Microsoft is “grossly irresponsible” and mired in a “culture of toxic obfuscation.”

The comments from Amit Yoran, chairman and CEO of Tenable, come six days after Sen. Ron Wyden (D-Ore.) blasted Microsoft for what he said were “negligent cybersecurity practices” that enabled hackers backed by the Chinese government to steal hundreds of thousands of emails from cloud customers……….
Microsoft comes under blistering criticism for “grossly irresponsible” security

August 2, 2023 by kittycool only

Installing CUDA with Ansible for Rocky Linux 8

Installation Guide

You can take a look at Nvidia CUDA Installation Guide for more information

Step 1: Get the Nvidia CUDA Repo

You can find the Repo from the Nvidia Download Sites. It should be named cuda_rhel8.repo. Copy it and use it as a template with a j2 extension.

[cuda-rhel8-x86_64]
name=cuda-rhel8-x86_64
baseurl=https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64
enabled=1
gpgcheck=1
gpgkey=https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/D42D0685.pub

Step 2: Use Ansible to Generate the repo from Templates.

The Ansible Script should look like this.

 - name: Generate /etc/yum.repos.d/cuda_rhel8.repo
   template:
    src: ../templates/cuda-rhel8-repo.j2
    dest: /etc/yum.repos.d/cuda_rhel8.repo
    owner: root
    group: root
    mode: 0644
   become: true
   when:
    - ansible_os_family == "RedHat"
    - ansible_distribution_major_version == "8"

Step 3: Install the Kernel-Headers and Kernel-Devel

The CUDA Driver requires that the kernel headers and development packages for the running version of the kernel be installed at the time of the driver installation, as well as whenever the driver is rebuilt.

- name: Install Kernel-Headers and  Kernel-Devel
  dnf:
    name:
        - kernel-devel
        - kernel-headers
    state: present
  when:
    - ansible_os_family == "RedHat"
    - ansible_distribution_major_version == "8"

Step 4: Disabling Nouveau

To install the Display Driver, the Nouveau drivers must first be disabled. I use a template to disable it. I created a template called blacklist-nouveau-conf.j2. Here is the content

blacklist nouveau
options nouveau modeset=0

The Ansible script for disabling Noveau using a template

- name: Generate blacklist nouveau
  template:
    src: ../templates/blacklist-nouveau-conf.j2
    dest: /etc/modprobe.d/blacklist-nouveau.conf
    owner: root
    group: root
    mode: 0644
  become: true
  when:
    - ansible_os_family == "RedHat"
    - ansible_distribution_major_version == "8"

Step 5: Install the Drivers and CUDA

- name: Install driver packages RHEL 8 and newer
  dnf:
    name: '@nvidia-driver:latest-dkms'
    state: present
    update_cache: yes
  when:
    - ansible_os_family == "RedHat"
    - ansible_distribution_major_version == "8"
  register: install_driver

- name: Install CUDA
  dnf:
    name: cuda
    state: present
  when:
    - ansible_os_family == "RedHat"
    - ansible_distribution_major_version == "8"
  register: install_cuda

Step 6: Reboot if there are changes to Drivers and CUDA

- name: Reboot if there are changes to Drivers or CUDA
  ansible.builtin.reboot:
  when:
    - install_driver.changed or install_cuda.changed
    - ansible_os_family == "RedHat"
    - ansible_distribution_major_version == "8"

Aftermath

After reboot, you should try to do “nvidia-smi” commands, hopefully, you should see

If you have an error “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver for RHEL 8“, do follow the steps in NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver for RHEL 8 and run the ansible script in the blog.

You may also combine all these yml into one large yml file

Other better? Ansible Scripts

You may want to consider other better? options for https://github.com/NVIDIA/ansible-role-nvidia-docker

July 24, 2023 by kittycool only

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver for RHEL 8

If you have installed the CUDA Drivers and CUDA SDK using the NVIDIA CUDA Installation Guide for Linux. Look for Section 3.3.3 for RHEL 8 / Rocky 9

If after following instruction, you are still facing issues, you may want to consider the following

Step 1 – Blacklist nouveau.conf

$ vim /etc/modprobe.d/blacklist-nouveau.conf

blacklist nouveau
options nouveau modeset=0

Step 2 – Remove Nvidia driver installation

# dnf module remove --all nvidia-driver

Step 3 – Remove CUDA-Related Installation

sudo dnf remove "cuda*" "*cublas*" "*cufft*" "*cufile*" "*curand*" \
 "*cusolver*" "*cusparse*" "*gds-tools*" "*npp*" "*nvjpeg*" "nsight*"

Step 4 – Reboot

# shutdown -r now

Step 5 – Make sure your BIOS SecureBoot is off. You can check

mokutil --sb-state
SecureBoot disabled

References:

Forum – CentOS Stream 8: NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver

July 18, 2023 by kittycool only

Guide to Creating Symbolic Links with Ansible

You can use the ansible.builtin.file module. In my example below, I wanted to link the Module Environment profile.csh and profile.sh to be placed on the /etc/profile.d so that it will load on startup. Do take a look at the Ansible Document ansible.builtin.file module – Manage files and file properties

- name: Check for CUDA Link
  stat: path=/usr/local/cuda
  register: link_available

- name: Create a symbolic link for CUDA
  ansible.builtin.file:
    src: /usr/local/cuda-12.2
    dest: /usr/local/cuda
    owner: root
    group: root
    state: link
  when:
    - ansible_os_family == "RedHat"
    - ansible_distribution_major_version == "8"
    - link_available.stat.isdir is not defined and link_available.stat.isdir == False

July 15, 2023 by kittycool only

Compiling VASP.6.3.0 using Nvidia hpcx, gcc-12.3 and MKL on Rocky Linux 8.7

For information, do take a look at VASP – Install VASP.6.X.X

VASP support several compilers. But we will be focusing on Nvidia hpcx only for this blog. To compile Nvidia hpcx, do take a look at Installing and using Mellanox HPC-X Software Toolkit

You may want to use Nvida hpcx. You may want to module use

export HPCX_HOME=/usr/local/hpcx-v2.15-gcc-MLNX_OFED_LINUX-5-redhat8-cuda12-gdrcopy2-nccl2.17-x86_64
module use $HPCX_HOME/modulefiles

----------------- /usr/local/hpcx-v2.15 --------------------------------------------
hpcx  hpcx-debug  hpcx-debug-ompi  hpcx-mt  hpcx-mt-ompi  hpcx-ompi  hpcx-prof  hpcx-prof-ompi  hpcx-stack

You can untar the VASP.6.3.3. and potpaw_PBE.54

% tar -xvf vasp.6.3.0.tar
% tar -xvf potpaw_PBE.54.tar

At the installation base of vasp.6.3.0 base

% cp arch/makefile.include.gnu_ompi_mkl_omp ./makefile.include

You will need the latest GNU GCC-10 or the latest for a successful compile. I compiled with GCC-12.3. You may want to take a look at Compiling GCC 12.1.0 on Rocky Linux 8.5

If you are using OneAPI Intel MKL, you can use module use after compilation. It will not be covered in this write-up. But you can:

% module use /usr/local/intel/2023.1/modulefiles

Finally,

% module load mkl/latest
% module load gnu/gcc-12.3
% module load hpcx
% make veryclean
% make DEPS=1 -j

.....
.....
make[2]: Leaving directory '/usr/local/vasp/vasp.6.3.0/build/std'
make[1]: Leaving directory '/usr/local/vasp/vasp.6.3.0/build/std'

In the bin directory, you should see vasp_gam vasp_ncl vasp_std

July 11, 2023 by kittycool only

Using Ansible Expect Module to executes a command and responds to prompts

Ansible Documentation:

ansible.builtin.expect module – Executes a command and responds to prompts

Ansible Expect Module is very useful to listen for certain strings in stdout and react accordingly. This is particularly useful if you have to respond to accept a license agreement or enter some important information. Here is my sample

- name: Install RPM package from local system
  yum:
    name: /tmp/my-software.rpm
    state: present
    disable_gpg_check: true
  when: ansible_os_family == "RedHat"

- name:
  ansible.builtin.stat:
    path: /usr/local/mysoftware
  register: directory_check

- name: Setup Licensing Server's Connection if directory does not exist
  ansible.builtin.expect:
    command: /usr/local/mysoftware/install.sh
    responses:
      (?i)Do you already have a license server on your network? [y/N] "y"
      (?i)Enter the name (or IP address) of your license server "xx.xx.xx.xx"
      (?i)Install/update the MySoftware web service? [Y/n] "n"
  when: not directory_check.stat.isdir

The Linux Cluster

Linux Cluster Blog is a collection of how-to and tutorials for Linux Cluster and Enterprise Linux