Nvidia-smi slow startup fix

If you encounter slow nvidia-smi before the information is shown. For my 8 x A40 Cards, it took about 26 seconds to initialise.

The reason for slow initialization might be due to the driver persistence issue. For more background on the issue, do take a look at Nvidia Driver Persistence. According to the article,

The NVIDIA GPU driver has historically followed Unix design philosophies by only initializing software and hardware state when the user has configured the system to do so. Traditionally, this configuration was done via the X Server and the GPUs were only initialized when the X Server (on behalf of the user) requested that they be enabled. This is very important for the ability to reconfigure the GPUs without a reboot (for example, changing SLI mode or bus settings, especially in the AGP days).

More recently, this has proven to be a problem within compute-only environments, where X is not used and the GPUs are accessed via transient instantiations of the Cuda library. This results in the GPU state being initialized and deinitialized more often than the user truly wants and leads to long load times for each Cuda job, on the order of seconds.

NVIDIA previously provided Persistence Mode to solve this issue. This is a kernel-level solution that can be configured using nvidia-smi. This approach would prevent the kernel module from fully unloading software and hardware state when no user software was using the GPU. However, this approach creates subtle interaction problems with the rest of the system that have made maintenance difficult.

The purpose of the NVIDIA Persistence Daemon is to replace this kernel-level solution with a more robust user-space solution. This enables compute-only environments to more closely resemble the historically typical graphics environments that the NVIDIA GPU driver was designed around.

Nvidia Driver Persistence

The Solution is very easy. Just start and enable nvidia-persistenced

# systemctl enable nvidia-persistenced
# systemctl start nvidia-persistenced

Immediately, the nvidia-smi command becomes more responsive

Errors with sirius-7.5.2 packages during installation

If you are installing based on Installing CP2K with Nvidia HPCX on Rocky Linux 8.5 and if you encounter the issue

.....
Installing from scratch into /usr/local/software/cp2k/tools/toolchain/install/sirius-7.5.2
for (auto it : unit_cell_.spl_num_paw_atoms()) {
^
/usr/local/software/cp2k/tools/toolchain/build/SIRIUS-7.5.2/src/potential/potential.hpp:710:9: error: expected iteration declaration or initialization
for (auto it : unit_cell_.spl_num_paw_atoms()) {
^~~
/usr/local/software/cp2k/tools/toolchain/build/SIRIUS-7.5.2/src/potential/potential.hpp:717:5: warning: no return statement in function returning non-void [-Wreturn-type]
}
.....

The reason is that the sirius package is installed by default. The issue can be issued if you put the parameters “–with-sirius=no”

% cd cp2k
% cd /usr/local/software/cp2k/tools/toolchain
% ./install_cp2k_toolchain.sh --no-check-certificate --with-openmpi --with-sirius=no

/usr/bin/ld: cannot find -liberty on Rocky Linux 8

I was trying to install CP2K on Rocky Linux 8 using the Installing CP2K with Nvidia HPCX on Rocky Linux 8.5 and I encountered an issue

libtool: warning: library '/usr/local/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-x86_64/ucc/lib/libucc.la' was moved.
/usr/bin/ld: cannot find -liberty
collect2: error: ld returned 1 exit status

The Resolution can be done by installing binutils-devel

# dnf install binutils-devel

How to disable vulnerable ciphers and encryption modes in Rocky Linux 8 to mitigate Terrapin Attack (CVE-2023-48795)

For more information on the Terrapin Attack (CVE-2023-48795), do take a look at Terrapin Attack (CVE-2023-48795): SSH Protocol Impacted.

As mentioned, in the blog entry, Terrapin Attack (CVE-2023-48795): SSH Protocol Impacted, the attack is possible only if you use vulnerable ciphers and encryption modes: ChaCha20-Poly1305, CTR-EtM, CBC-EtM. Note that the cyphers and the encryption modes themselves are not vulnerable, but their input (sequence number) can be manipulated by the attacker.

The mitigation is similar to How to disable CBC Mode Ciphers in RHEL 8 or Rocky Linux 8 except that you have to remove the “chacha20-poly1305@openssh.com” besides the CBC Mode Ciphers.

Step 1: Edit /etc/sysconfig/sshd and uncomment CRYPTO_POLICY line:

CRYPTO_POLICY=

Edit /etc/ssh/sshd_config file. Add Ciphers, MACs and KexAlgorithms have been added

KexAlgorithms curve25519-sha256@libssh.org,ecdh-sha2-nistp521,ecdh-sha2-nistp384,ecdh-sha2-nistp256,diffie-hellman-group-exchange-sha256

Ciphers aes256-gcm@openssh.com,aes128-gcm@openssh.com,aes256-ctr,aes192-ctr,aes128-ctr
MACs hmac-sha2-512-etm@openssh.com,hmac-sha2-256-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-512,hmac-sha2-256,umac-128@openssh.com

After making changes to the configuration file, you may want to do a sanity check on the configuration file

# sshd -t

Restart sshd services

# systemctl restart sshd

To test if weak CBC Ciphers and  ChaCha20-Poly1305 are enabled

$ ssh -vv -oCiphers=chacha20-poly1305@openssh.com,3des-cbc,aes128-cbc,aes192-cbc,aes256-cbc IP-Address-of-your-Server

You should receive a similar message

Unable to negotiate with 172.21.33.11 port 22: no matching cipher found. Their offer: aes256-gcm@openssh.com,aes128-gcm@openssh.com,aes256-ctr,aes192-ctr,aes128-ctr

To verify that the Terrapin Attack Vulnerability is mitigated, take a look at Vulnerability Scanner. Pre-built binaries for all major platforms and the source code are available on GitHub.

./Terrapin_Scanner_Linux_amd64 -connect XXX.XXX.XXX.XXX

If you are not vulnerable, you may have an output like this.

References:

  1. Terrapin Attack (CVE-2023-48795): SSH Protocol Impacted
  2. Terrapin Attack
  3. Terrapin (CVE-2023-48795): New Attack Impacts the SSH Protocol
  4. SSH protocol flaw – Terrapin Attack CVE-2023-48795: All you need to know

Terrapin Attack (CVE-2023-48795): SSH Protocol Impacted

What is Terrapin Attack (CVE-2023-48795)?

Researchers from Ruhr University announced the discovery of new vulnerabilities impacting the SSH Protocol. Detailed Information of the Terrapin Attack can be found at Terrapin Attack.

According to FOSSA Terrapin (CVE-2023-48795): New Attack Impacts the SSH Protocol

Terrapin is a man-in-the-middle attack; the flaw allows an attacker to corrupt data being transmitted. This can result in a loss of information or bypass critical security controls such as keystroke timing protections or SHA-2 cryptographic hash requirements, allowing the threat actor to downgrade to SHA-1. Doing so opens up the possibility of other attacks on downstream applications, components, or environments that use SSH. ​​These associated vulnerabilities have been assigned CVE-2023-46445 (Rogue Extension Negotiation Attack in AsyncSSH) and CVE-2023-46446 (Rogue Session Attack in AsyncSSH).

Terrapin (CVE-2023-48795): New Attack Impacts the SSH Protocol

How do I know that I am vulnerable?

The attack is possible only if you use vulnerable ciphers and encryption modes: ChaCha20-Poly1305, CTR-EtM, CBC-EtM. Note that the cyphers and the encryption modes themselves are not vulnerable, but their input (sequence number) can be manipulated by the attacker.

How do I mitigate the attack?

To mitigate the attack, either you upgrade OpenSSH to their latest version 9.6 or disable the affected ciphers and encryption modes.

Vulnerability Impacts

According to FOSSA Terrapin (CVE-2023-48795): New Attack Impacts the SSH Protocol, their assessment of the Attack are:

  • Limited Impacts: Terrapin can delete consecutive portions of encrypted messages, which in isolation will typically result in a stuck connection. Some of the most serious impacts identified are in downstream applications implementing SSH, such as AsyncSSH. An attacker may be able to disable certain keylogging obfuscation features, enabling them to conduct a keylogging attack; or, worst case, a threat actor can sign a victim’s client into another account without the victim noticing, enabling phishing attacks.
  • Difficult to Expliot: An active man-in-the-middle attacker and specific encryption modes are prerequisites for the exploit. Intercepting SSH traffic requires a detailed understanding of a target’s environment, limiting real-world applicability.

How do I check?

You may want to explore the vulnerablilty tool published by the Ruhr University Researchers:

For more information, do take look at Vulnerability Scanner. Pre-built binaries for all major platforms and the source code are available on GitHub.

Usage is very simple, after downloading the relevant binary, just use the command

./Terrapin_Scanner_Linux_amd64 -connect XXX.XXX.XXX.XXX

If you are not vulnerable, you may have a output something like this.

If you are vulnerable, the scanner will flag as expected.

References:

  1. Terrapin Attack
  2. Terrapin (CVE-2023-48795): New Attack Impacts the SSH Protocol
  3. SSH protocol flaw – Terrapin Attack CVE-2023-48795: All you need to know

Troubleshooting the PBS Control System and PBS Server

I was having this issue after I submitted a job. This was due to some configuration I had to do to improve security which is similar to Using the Host’s FirewallD as the Main Firewall to Secure Docker

qsub: Budget Manager: License is unverified. AM is not handling requests

To resolve the issue, I took the following Steps. On the PBS-Control Server,

Step 1: Export the Path of the AM Database.

export PATH=/opt/am/postgres/bin:$PATH

Step 2: Check that the Docker Container Services are started in the System. You may want to start the dockers to capture any errors. If the docker is not able to start up, it is likely due to the firewall settings.

# systemctl status firewalld.service.

Step 3: I restarted the PBS Altair Service

# systemctl restart altaircontrol.service

Step 4: I use the Docker Command to return an overview of all running containers

# docker ps 

At the PBS-Server, Restart the AM Control Register is working

# /opt/am/libexec/am_control_register

To Test, Submit an Interactive Job with the correct Project Code, it should work.

Using the Host’s FirewallD as the Main Firewall to Secure Docker

Found a rare article How to Secure a Docker Host Using Firewalld that teaches how to address the issue when that docker bypasses the FirewallD rules.

According to the Article, the goal of the Configuration is to

  • The firewall rules should count for whole host system – so including Docker containers with port mappings
  • A Docker container should be accessible from the internet if and only if the host port used in Docker container port mapping is allowed in the firewall
  • The approach should not break container networking

Do read up and you will be glad that this article was written for Administrators like us. Another Reference you may want to consider reading is Why Docker and Firewall don’t get along with each other!

Automating Security Patch Logs and MS-Team Notifications with Ansible on Rocky Linux 8

If you have read the blog entry Using Ansible to automate Security Patch on Rocky Linux 8, you may want to consider capturing the logs and send notification to MS-Team if you are using that as a Communication Channel. This is a follow-up to that blog.

Please look at Part 1: Using Ansible to automate Security Patch on Rocky Linux 8

Writing logs (Option 1: Ansible Command used if just checking)

Recall that in Option 1: Ansible Command used if just checking, Part 1a & Part 1b, you can consider writing to logs in /var/log/ansible_logs

- name: Create a directory if it does not exist
file:
path: /var/log/ansible_logs
state: directory
mode: '0755'
owner: root
when:
- ansible_os_family == "RedHat"
- ansible_distribution_major_version == "8"



- name: Copy Results to file
ansible.builtin.copy:
content: "{{ register_output_security.results | map(attribute='name') | list }}"
dest: /var/log/ansible_logs/patch-list_{{ansible_date_time.date}}.log
changed_when: false
when:
- ansible_os_family == "RedHat"
- ansible_distribution_major_version == "8"

Notification (Option 1: Ansible Command used if just checking)

You can write to MS Team to provide a short notification to let the Engineers knows that the logs has been written to /var/log/ansible_logs

- name: Send a notification to MS-Teams that Test Run (No Patching) is completed
run_once: true
uri:
url: "https://xxxxxxx.webhook.office.com/webhookb2/xxxxxxxxxxxxxxxxxxxxxxxxx"
method: POST
body_format: json
body:
title: "Test Patch Run on {{ansible_date_time.date}}"
text: "Test Run only. System has not been Patched Yet. Logs saved at: /var/log/ansible_logs/patch-list_{{ansible_date_time.date}}.log"
when:
- register_update_success is defined
- ext_permit_flag == "no"

Writing to MS-Team to capture the success Or failure of the Update (Option 2: Ansible Command used when ready for Patching)

- name: Send a notification to MS-Teams Channel if Upgrade failed
run_once: true
uri:
url: "https://xxxxx.webhook.office.com/webhookb2/xxxxxx"
method: POST
body_format: json
body:
title: "Patch Run on {{ansible_date_time.date}}"
text: "Patch Update has Failed"
when:
- register_update_success is not defined
- ext_permit_flag == "yes"



- name: Send a notification to MS-Teams Channel if Upgrade failed
run_once: true
uri:
url: "https://entuedu.webhook.office.com/webhookb2/xxxxxx"
method: POST
body_format: json
body:
title: "Patch Run on {{ansible_date_time.date}}"
text: "Patch Update is Successful. Logs saved at: /var/log/ansible_logs/patch-list_{{ansible_date_time.date}}.log"
when:
- register_update_success is defined
- ext_permit_flag == "yes"

Automating Linux Patching with Ansible: A Simple Guide

If you intend to use Ansible to patch the Server, you may need to use an external variable to decide whether you wish to take a look at the list or actually patch the OS. It consists of 3 parts.

Option 1: Ansible Command used if just checking

$ ansible-playbook security.yml --extra-vars "ext_permit_flag=no"

Part 1a: Get the List of Packages from DNF to be upgraded ONLY using the External Permit Flag = “no”

- name: Get the list of Packages from DNF to be upgraded (ext_permit_flag == "no")
dnf:
security: yes
bugfix: false
state: latest
update_cache: yes
list: updates
exclude: 'kernel*'
register: register_output_security
when:
- ansible_os_family == "RedHat"
- ansible_distribution_major_version == "8"
- ext_permit_flag == "no"

Part 1b: Report the List of Packages from DNF to be upgraded ONLY using the External Permit Flag = “no”

- name: Report the List of Packages from DNF to be upgraded ( ext_permit_flag == no")
debug:
msg: "{{ register_output_security.results | map(attribute='name') | list }}"
when:
- ansible_os_family == "RedHat"
- ansible_distribution_major_version == "8"
- ext_permit_flag == "no"

Option 2: Ansible Command used when ready for Patching

$ ansible-playbook security.yml --extra-vars "ext_permit_flag=yes"

Part 2: Patch all the packages except Kernel

- name: Patch all the packages except Kernel

dnf:
name: '*'
security: yes
bugfix: false
state: latest
update_cache: yes
update_only: no
exclude: 'kernel*'
register: register_update_success
when:
- ansible_os_family == "RedHat"
- ansible_distribution_major_version == "8"
- ext_permit_flag == "yes"

- name: Print Errors if upgrade failed
debug:
msg: "Patch Update Failed"
when: register_update_success is not defined

Reference:

  1. ansible.builtin.dnf module – Manages packages with the dnf package manager
  2. Automating Linux patching with Ansible

Change Runlevel of Rocky Linux 8 without rebooting

You may use the command “telinit” to change the SysV system runlevel without the need to reboot. You will need root access to run the command

Use 1: Single User Mode

# telinit S

Use 2: To go to Graphical.Target

# telinit 5

Use 3: To go to Multi-User.Target

# telinit 3

Use 4: To Reload daemon configuration. This is equivalent to systemctl daemon-reload.

# telinit q

To verify, you can use the command systemctl get-default

# systemctl get-default
graphical.target