CUDA driver version is insufficient for CUDA runtime version

When you do a “/usr/local/cuda-10.1/extras/demo_suite/deviceQuery”. You might get the errors seemed above

[root@node1 ~]# /usr/local/cuda-10.1/extras/demo_suite/deviceQuery
/usr/local/cuda-10.1/extras/demo_suite/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL

The Issue may cause some confusion. It is not your libraries. But the it is the Power Setting at the BIOS. Most Servers are configured to be balanced. But for GPGPU, you need to put Power to “Maximum Performance”. For example, for HPE Server, you should put “Static High Performance Mode”

How to unmount NFS mount that fails to unmount with ‘device is busy’

If you are attempting to unmount a NFS command like

# mount -t nfs -o remount /mnt/nfs 
# umount /mnt/nfs 
# umount -f /mnt/nfs 
# umount -l /mnt/nfs 
# umount -lf /mnt/nfs

Identify which processes tied to the mount need to be killed by using lsof and fuser:

# lsof | grep /mnt/nfs

lsof command above identifies the PID of the processes associated with the /mnt/nfs share. Kill any processes locking the stale mount.

Try to force umount again after the processes as been killed

# umount -lf

References:

  1. How to unmount a stale NFS mount that fails to unmount with ‘device is busy’ after network disconnectivity?

How AI Is Reshaping HPC And What This Means For Data Center Architects

In quarterly earnings reports this year, the CEO and founder of NVIDIA (a Liqid partner) noted that its recent advancements in delivering its new compute platform designed with AI in mind and its acquisition of a leading networking company this year are all designed to achieve the central goal of advancing what is increasingly known as data center-scale computing. For providers of high-performance computing solutions, both those built around NVIDIA’s tech and those that are competing with the GPU goliath, this need for data center-scale computing has been defined by and escalated alongside the data performance requirements of artificial intelligence and machine learning (AI+ML), something I discuss further in a recent article.

https://www.forbes.com/sites/forbestechcouncil/2021/01/19/how-ai-is-reshaping-hpc-and-what-this-means-for-data-center-architects/?sh=3dec4e4d7371

How to train a robot (using AI and supercomputers)

From Science Daily

Computer scientists developed a deep learning method to create realistic objects for virtual environments that can be used to train robots. The researchers used TACC’s Maverick2 supercomputer to train the generative adversarial network. The network is the first that can produce colored point clouds with fine details at multiple resolutions.

https://www.sciencedaily.com/releases/2021/01/210119194329.htm