December 1, 2020 by kittycool only

Maximizing Performance for Distributed Machine Learning and Deep Learning with SHARP

The NVIDIA Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) takes advantage of the in-network computing capabilities in the NVIDIA Mellanox Quantum switch, dramatically improving the performance of distributed machine learning workloads.

December 1, 2020 by kittycool only

Listing processes for a specific user

Using htop to list users. Which is one of my favourite.

% top -U user1

pstree which displays a tree of processes and can include parents and child processes which make it easier to understand.

% pstree -l -a -p -s user1

where
-l : Long format
-a : Show command line args
-p : Display Linux PIDs
-s : See parents of the selected process

pgrep look up or signal processes based on name and other attributes

% pgrep -l -u user1

References:

Linux list processes by user names

December 1, 2020 by kittycool only

SC20 TOP500 Birds of a Feather (BoF)

The TOP500 list of supercomputers serves as a “Who’s Who” in the field of high-performance computing (HPC). It started as a list of the most powerful supercomputers in the world and has evolved to a major source of information about trends in HPC. The 56th TOP500 list was published in November 2020 just in time for SC20.

November 30, 2020 by kittycool only

No rule to make target /usr/include/sgidefs.h, needed by `surf.o’

I was compiling an external programs required by VMD surf at $VMDHOME/vmd-1.9.4a48/lib/surf

% make depend
make: *** No rule to make target `/usr/include/sgidefs.h', needed by `surf.o'. Stop.

You will require yum install imake

# yum install makedepend

November 28, 2020 by kittycool only

Installing CUDA Python

How to install CUDA Python followed by a tutorial on how to run a Python example on a GPU

November 26, 2020 by kittycool only

Digital Scalable multi-node training for AI jobs on NVIDIA DGX, OpenShift and Spectrum Scale

Nvidia and IBM did a complex proof-of-concept to demonstrate the scaling of AI workload using Nvidia DGX, Red Hat OpenShift and IBM Spectrum Scale at the example of ResNet-50 and the segmentation of images using the Audi A2D2 dataset. The project team published an IBM Redpaper with all the technical details and will present the key learnings and results.

November 25, 2020 by kittycool only

Tutorial on In-Network Computing SHARP Technology for MPI Offloads

In this video from the 2017 HPC Advisory Council Stanford Conference, Devendar Bureddy from Mellanox presents a Tutorial on In-Network Computing SHARP Technology for MPI Offloads.

November 25, 2020 by kittycool only

How to prevent SSH from disconnecting

From Sharcnet HPC

November 23, 2020 by kittycool only

Building the Future Today with HPC

At SC20, Intel’s Trish Damkroger, vice president and general manager of HPC at Intel, shows how Intel and its partners are building the future of HPC today through hardware and software technologies that accelerate the broad deployment of advanced HPC systems. (Credit: Intel Corporation)

November 22, 2020 by kittycool only

NVIDIA SC20 Special Address

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

The Linux Cluster

Linux Cluster Blog is a collection of how-to and tutorials for Linux Cluster and Enterprise Linux

Author: kittycool only

Maximizing Performance for Distributed Machine Learning and Deep Learning with SHARP

Listing processes for a specific user

SC20 TOP500 Birds of a Feather (BoF)

No rule to make target /usr/include/sgidefs.h, needed by `surf.o’

Installing CUDA Python

Digital Scalable multi-node training for AI jobs on NVIDIA DGX, OpenShift and Spectrum Scale

Tutorial on In-Network Computing SHARP Technology for MPI Offloads

How to prevent SSH from disconnecting

Building the Future Today with HPC

NVIDIA SC20 Special Address