Checking Disk Usage within the subfolders but avoid mount-point

If you need to check Usage, but you wish to avoid the mount-point, you can use the command

[root@hpc-hn /]# du -h -x -d 1
48M     ./etc
552M    ./root
11G     ./var
1.1G    ./tmp
11G     ./usr
0       ./media
0       ./mnt
4.8G    ./opt
0       ./srv
0       ./install
0       ./log
0       ./misc
0       ./net
0       ./server_priv
0       ./ProjectSpace
0       ./media1
0       ./media2
28G     .
  • -h refers to human-readable
  • -d refers to depth level. By default, it is 0 which is the same as summarize
  • -x skip directories on different file systems

Understanding Load Average in Linux

Taken from RedHat Article “What is the relation between I/O wait and load average?” I have learned quite a bit on this article.

Linux, unlike traditional UNIX operating systems, computes its load average as the average number of runnable or running processes (R state), and the number of processes in uninterruptable sleep (D state) over the specified interval. On UNIX systems, only the runnable or running processes are taken into account for the load average calculation.

On Linux the load average is a measurement of the amount of “work” being done by the machine (without being specific as to what that work is). This “work” could reflect a CPU intensive application (compiling a program or encrypting a file), or something I/O intensive (copying a file from disk to disk, or doing a database full table scan), or a combination of the two.

In the article, you can determine whether the high load average is the result processes in the running state or uninterruptible state,

I like this script…… that was written in the knowledgebase. The script show the running, blocked and runnin+blocked.

[user@node1 ~]$ while true; do echo; uptime; ps -efl | awk 'BEGIN {running = 0; blocked = 0} $2 ~ /R/ {running++}; $2 ~ /D/ {blocked++} END {print "Number of running/blocked/running+blocked processes: "running"/"blocked"/"running+blocked}'; sleep 5; done

 23:45:52 up 52 days,  7:06, 22 users,  load average: 1.40, 1.26, 1.02
Number of running/blocked/running+blocked processes: 3/1/4

 23:45:57 up 52 days,  7:06, 22 users,  load average: 1.45, 1.27, 1.02
Number of running/blocked/running+blocked processes: 2/0/2

 23:46:02 up 52 days,  7:06, 22 users,  load average: 1.41, 1.27, 1.02
Number of running/blocked/running+blocked processes: 1/1/2

 23:46:07 up 52 days,  7:07, 22 users,  load average: 1.46, 1.28, 1.03
Number of running/blocked/running+blocked processes: 2/0/2

 23:46:12 up 52 days,  7:07, 22 users,  load average: 1.42, 1.27, 1.03
Number of running/blocked/running+blocked processes: 2/0/2

 23:46:17 up 52 days,  7:07, 22 users,  load average: 1.55, 1.30, 1.04
Number of running/blocked/running+blocked processes: 2/0/2

 23:46:22 up 52 days,  7:07, 22 users,  load average: 1.51, 1.30, 1.04
Number of running/blocked/running+blocked processes: 1/1/2

 23:46:27 up 52 days,  7:07, 22 users,  load average: 1.55, 1.31, 1.05
Number of running/blocked/running+blocked processes: 2/0/2

 23:46:32 up 52 days,  7:07, 22 users,  load average: 1.62, 1.33, 1.06
Number of running/blocked/running+blocked processes: 2/1/3

 23:46:38 up 52 days,  7:07, 22 users,  load average: 1.81, 1.38, 1.07
Number of running/blocked/running+blocked processes: 1/1/2

 23:46:43 up 52 days,  7:07, 22 users,  load average: 1.66, 1.35, 1.07
Number of running/blocked/running+blocked processes: 1/0/1

 23:46:48 up 52 days,  7:07, 22 users,  load average: 1.53, 1.33, 1.06
Number of running/blocked/running+blocked processes: 1/0/1

Another useful way to typical top output when the load average is high (filter the idle/sleep status tasks with i). So the high load average is because lots of sendmail tasks are in D status. They may be waiting either for I/O or network.

op - 13:23:21 up 329 days,  8:35,  0 users,  load average: 50.13, 13.22, 6.27
Tasks: 437 total,   1 running, 435 sleeping,   0 stopped,   1 zombie
Cpu(s):  0.1%us,  1.5%sy,  0.0%ni, 93.6%id,  4.5%wa,  0.1%hi,  0.2%si,  0.0%st
Mem:  34970576k total, 24700568k used, 10270008k free,  1166628k buffers
Swap:  2096440k total,        0k used,  2096440k free, 11233868k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
11975 root      15   0 13036 1356  820 R  0.7  0.0   0:00.66 top                
15915 root      17   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           
15918 root      17   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           
15920 root      17   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           
15921 root      17   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           
15922 root      17   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           
15923 root      17   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           
15924 root      17   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           
15926 root      17   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           
15928 root      17   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           
15929 root      17   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           
15930 root      17   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           
15931 root      18   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           

References:

  1. What is the relation between I/O wait and load average?

Finding Top Processes using Highest Memory and CPU Usage in Linux

Read this Article from Find Top Running Processes by Highest Memory and CPU Usage in Linux. This is a quick way to view processes that consumed the largest RAM and CPU

ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%mem | head
   PID   PPID CMD                         %MEM %CPU
414699 414695 /usr/local/ansys_inc/v201/f 20.4 98.8
 30371      1 /usr/local/pbsworks/pbs_acc  0.2  1.0
 32241      1 /usr/local/pbsworks/pbs_acc  0.2  4.0
 30222      1 /usr/local/pbsworks/pbs_acc  0.2  0.6
  7191      1 /usr/local/pbsworks/dm_exec  0.1  0.8
 30595      1 /usr/local/pbsworks/pbs_acc  0.1  3.1
 30013      1 /usr/local/pbsworks/pbs_acc  0.1  0.3
 29602  29599 nginx: worker process        0.1  0.2
 29601  29599 nginx: worker process        0.1  0.3

The -o is to specify the output format. The -e is to select all processes. In order to sort in descending format, it hsould be –sort=%mem

Interesting.

Getting Useful Information on CPU and Configuration

Point 1. lscpu

To install

yum install util-linux

lscpu – (Print out information about CPU and its configuration)

[user1@myheadnode1 ~]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz
Stepping: 4
CPU MHz: 3200.000
BogoMIPS: 6400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 25344K
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31
Flags: fpu .................

Point 2: hwloc-ls

To install hwloc-ls

yum install hwloc

hwloc – (Prints out useful information about the NUMA locality of devices and general hardware locality information)

[user1@myheadnode1 ~]# hwloc-ls
Machine (544GB total)
NUMANode L#0 (P#0 256GB)
Package L#0 + L3 L#0 (25MB)
L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#16)
L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#17)
L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#18)
L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#19)
L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#20)
L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#21)
L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#22)
L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#23)
.....
.....
.....

Point 3 – Check whether the Boost is on for AMD

Print out if CPU boost is on or off

cat /sys/devices/system/cpu/cpufreq/boost
1

References:

  1. Tuning Guard for AMD EPYC (pdf)

Checking the Limits an application is imposed during run

If you wish to look at a specific application limits during run, you can do the following

pgrep fortcom
12345

* I used for fortcom, but it could be any application you wish to take a look.

cat /proc/12345/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 4096 2190327 processes
Max open files 1024 4096 files
Max locked memory unlimited unlimited bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 2190327 2190327 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us

* You can take a look that there is no limits to Max locked Memory and Max file locks are unlimited.

Another way to calculate shared memory swapping

Using ipcs utlities to find out information on shared memory utilisation which can be useful for analysing the performance of the system. Let’s says you want to measure how much memory has been swapped.

% ipcs -mu
------ Shared Memory Status --------
segments allocated 55
pages allocated 6655333
pages resident  5661034
pages swapped   947522
Swap performance: 0 attempts     0 successes

where
-m is “information about active shared memory segments”
-u is “Show status summary”

You would need PAGE Memory

getconf PAGESIZE
4096

To provide us with the information in MB

echo "$((947522*4096/1024/1024)) MB"
3701 MB

Using strace to detect df hanging issues on NFS

strace is a wonderful tool to trace system calls and signals

I was hanging issues whenever I do a “df” and I was curious which file system is calling issues

% strace df
.....
.....
stat("/run/user/1304561586", {st_mode=S_IFDIR|0700, st_size=40, ...}) = 0
stat("/run/user/17132623", {st_mode=S_IFDIR|0700, st_size=40, ...}) = 0
stat("/run/user/17149581", {st_mode=S_IFDIR|0700, st_size=40, ...}) = 0
stat("/run/user/1304565184", {st_mode=S_IFDIR|0700, st_size=60, ...}) = 0
stat("/scratch",

It is obvious that /scratch file hang immediately after being launched.

Using Find and Tar Together to Backup and Archive

Point 1: If you wish to find files in a single folder and tar them into gzip-compressed archive. You can use a one-liner to do it.

% find -maxdepth 1 -name '*.sh' | tar czf script.tgz -T -

“-maxdepth” refers to the current depth of 1 or current directory

“-T -” causes tar to read its list from a file rather than the command line. The “-” means standard input and output.

You should have a file is script.tgz

Point 2: If you wish to find files in a single folder and tar them into bzip2-compressed archive.

% find -maxdepth 1 -name '*.sh' | tar cjf script.tgz -T -

Checking Disk Usage within the subfolders

I like this command which I often use to look into the dish usages at the sub folder level to check for large usages

% du -h -d 1
1.3M    ./Espresso-BEEF
65M     ./MATLAB
478M    ./Abaqus
10G     ./COMSOL
8.3M    ./Gaussian2
316K    ./scripts
4.9M    ./NB07
647M    ./pytorch-GAN
31M     ./Gaussian
12G     .

where

  • -h refers to human-readable
  • -d refers to depth level. By default, it is 0 which is the same as summarize