What is Intel Cluster Checker?
Intel® Cluster Checker provides tools to collect data from the cluster, analysis of the collected data, and provides a clear report of the analysis. Using Intel® Cluster Checker helps to quickly identify issues and improve utilization of resources.
Intel® Cluster Checker verifies the configuration and performance of Linux®-based clusters through analysis of cluster uniformity, performance characteristics, functionality and compliance with Intel® High Performance Computing (HPC) specifications. Data collection tools and analysis provide actionable remedies to identified issues. Intel® Cluster Checker tools and analysis are ideal for use by developers, administrators, architects, and users to easily identify issues within a cluster.
Installing Intel Cluster Checker Using Yum Repository
If you are using Yum Installation, do take a look at Intel Cluster Checker 2019 Installation
If not, you can untar the package if you have the tar.gz
Environment Setup
# source /usr/local/intel/2018u3/bin/compilervars.sh intel64 # source /usr/local/intel/2018u3/mkl/bin/mklvars.sh intel64 # source /usr/local/intel/2018u3/impi/2018.3.222/bin64/mpivars.sh intel64 # source /usr/local/intel/2018u3/parallel_studio_xe_2018/bin/psxevars.sh intel64 # export MPI_ROOT=/usr/local/intel/2018u3/impi/2018.3.222/intel64
# source /usr/local/intel/cc2019/clck/2019.10/bin/clckvars.sh
Create a nodefile and put the hosts in
% vim nodefile
node1 node2 node3
Running Intel Cluster Checker
*Make sure you have SSH login to the nodes without password. See SSH Login without Password
% clck -f nodefile
Examples of run…..
Running Collect ................................................................................................................................................................................................................ Running Analyze SUMMARY Command-line: clck -f nodefile Tests Run: health_base **WARNING**: 3 tests failed to run. Information may be incomplete. See clck_execution_warnings.log for more information. Overall Result: 8 issues found - HARDWARE UNIFORMITY (2), PERFORMANCE (2), SOFTWARE UNIFORMITY (4) ----------------------------------------------------------------------------------------------------------------------------------------- 8 nodes tested: node010, node[003-009] 0 nodes with no issues: 8 nodes with issues: node010, node[003-009] ----------------------------------------------------------------------------------------------------------------------------------------- FUNCTIONALITY No issues detected. HARDWARE UNIFORMITY The following hardware uniformity issues were detected: 1. The InfiniBand PCI physical slot for device 'MT27800 Family [ConnectX-5]' PERFORMANCE The following performance issues were detected: 1.Zombie processes detected. 1 node: node010 2. Processes using high CPU. 7 nodes: node010, node[003,005-009] SOFTWARE UNIFORMITY The following software uniformity issues were detected: 1. The OFED version, 'MLNX_OFED_LINUX-4.5-1.0.1.0 (OFED-4.5-1.0.1)', is not uniform..... 5 nodes: node[003-004,006-007,009] 2. The OFED version, 'MLNX_OFED_LINUX-4.3-1.0.1.0 (OFED-4.3-1.0.1)', is not uniform..... 3 nodes: node010, node[005,008] 3. Environment variables are not uniform across the nodes. ..... 4. Inconsistent Ethernet driver version. ..... See the following files for more information: clck_results.log, clck_execution_warnings.log
Intel MPI Library Troubleshooting
If you are an admin and if you make sure their cluster is set up to work with the Intel® MPI Library, do the following
% clck -f nodefile -F mpi_prereq_admin
If you are non-privileged user and if you make sure their cluster is set up to work with the Intel® MPI Library, do the following
% clck -f nodefile -F mpi_prereq_user
More Information:
- Using Intel Cluster Checker (Part 1)
- Using Intel Cluster Checker (Part 2)
- Using Intel Cluster Checker (Part 3)
Pingback: Intel Cluster Check install and test | Bits and Dragons