rsync and write failed with No Space left on Device (28)

If you run an rsync such as this command

% rsync -lH -rva --no-inc-recursive --progress gromacs remote_server:/usr/local

and you encountered something like this

% rsync: write failed on "/usr/local": No space left on device (28)

After checking that the source and destination have sufficient space, you are still encountering the issue, you may want to put this parameter in “–inplace”. According to the rsync man page. “This option changes how rsync transfers a file when its data needs to be updated: instead of the default method of creating a new copy of the file and moving it into place when it is complete, rsync instead writes the updated data directly to the destination file.

WARNING: you should not use this option to update files that are being accessed by others, so be careful when choosing to use this for a copy. For more information, do take a look at https://download.samba.org/pub/rsync/rsync.html

% rsync -lH -rva --inplace --no-inc-recursive --progress gromacs remote_server:/usr/local

UDP Tuning to maximise performance

There is a interesting article how your UDP traffic can maximise performance with a few tweak. The article is taken from UDP Tuning

The most important factors as mentioned in the article is

  • Use jumbo frames: performance will be 4-5 times better using 9K MTUs
  • packet size: best performance is MTU size minus packet header size. For example for a 9000Byte MTU, use 8972 for IPV4, and 8952 for IPV6.
  • socket buffer size: For UDP, buffer size is not related to RTT the way TCP is, but the defaults are still not large enough. Setting the socket buffer to 4M seems to help a lot in most cases
  • core selection: UDP at 10G is typically CPU limited, so its important to pick the right core. This is particularly true on Sandy/Ivy Bridge motherboards.

Do take a look at the article UDP Tuning

Understanding Load Average in Linux

Taken from RedHat Article “What is the relation between I/O wait and load average?” I have learned quite a bit on this article.

Linux, unlike traditional UNIX operating systems, computes its load average as the average number of runnable or running processes (R state), and the number of processes in uninterruptable sleep (D state) over the specified interval. On UNIX systems, only the runnable or running processes are taken into account for the load average calculation.

On Linux the load average is a measurement of the amount of “work” being done by the machine (without being specific as to what that work is). This “work” could reflect a CPU intensive application (compiling a program or encrypting a file), or something I/O intensive (copying a file from disk to disk, or doing a database full table scan), or a combination of the two.

In the article, you can determine whether the high load average is the result processes in the running state or uninterruptible state,

I like this script…… that was written in the knowledgebase. The script show the running, blocked and runnin+blocked.

[user@node1 ~]$ while true; do echo; uptime; ps -efl | awk 'BEGIN {running = 0; blocked = 0} $2 ~ /R/ {running++}; $2 ~ /D/ {blocked++} END {print "Number of running/blocked/running+blocked processes: "running"/"blocked"/"running+blocked}'; sleep 5; done

 23:45:52 up 52 days,  7:06, 22 users,  load average: 1.40, 1.26, 1.02
Number of running/blocked/running+blocked processes: 3/1/4

 23:45:57 up 52 days,  7:06, 22 users,  load average: 1.45, 1.27, 1.02
Number of running/blocked/running+blocked processes: 2/0/2

 23:46:02 up 52 days,  7:06, 22 users,  load average: 1.41, 1.27, 1.02
Number of running/blocked/running+blocked processes: 1/1/2

 23:46:07 up 52 days,  7:07, 22 users,  load average: 1.46, 1.28, 1.03
Number of running/blocked/running+blocked processes: 2/0/2

 23:46:12 up 52 days,  7:07, 22 users,  load average: 1.42, 1.27, 1.03
Number of running/blocked/running+blocked processes: 2/0/2

 23:46:17 up 52 days,  7:07, 22 users,  load average: 1.55, 1.30, 1.04
Number of running/blocked/running+blocked processes: 2/0/2

 23:46:22 up 52 days,  7:07, 22 users,  load average: 1.51, 1.30, 1.04
Number of running/blocked/running+blocked processes: 1/1/2

 23:46:27 up 52 days,  7:07, 22 users,  load average: 1.55, 1.31, 1.05
Number of running/blocked/running+blocked processes: 2/0/2

 23:46:32 up 52 days,  7:07, 22 users,  load average: 1.62, 1.33, 1.06
Number of running/blocked/running+blocked processes: 2/1/3

 23:46:38 up 52 days,  7:07, 22 users,  load average: 1.81, 1.38, 1.07
Number of running/blocked/running+blocked processes: 1/1/2

 23:46:43 up 52 days,  7:07, 22 users,  load average: 1.66, 1.35, 1.07
Number of running/blocked/running+blocked processes: 1/0/1

 23:46:48 up 52 days,  7:07, 22 users,  load average: 1.53, 1.33, 1.06
Number of running/blocked/running+blocked processes: 1/0/1

Another useful way to typical top output when the load average is high (filter the idle/sleep status tasks with i). So the high load average is because lots of sendmail tasks are in D status. They may be waiting either for I/O or network.

op - 13:23:21 up 329 days,  8:35,  0 users,  load average: 50.13, 13.22, 6.27
Tasks: 437 total,   1 running, 435 sleeping,   0 stopped,   1 zombie
Cpu(s):  0.1%us,  1.5%sy,  0.0%ni, 93.6%id,  4.5%wa,  0.1%hi,  0.2%si,  0.0%st
Mem:  34970576k total, 24700568k used, 10270008k free,  1166628k buffers
Swap:  2096440k total,        0k used,  2096440k free, 11233868k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
11975 root      15   0 13036 1356  820 R  0.7  0.0   0:00.66 top                
15915 root      17   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           
15918 root      17   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           
15920 root      17   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           
15921 root      17   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           
15922 root      17   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           
15923 root      17   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           
15924 root      17   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           
15926 root      17   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           
15928 root      17   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           
15929 root      17   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           
15930 root      17   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           
15931 root      18   0  5312  872   80 D  0.0  0.0   0:00.00 sendmail           

References:

  1. What is the relation between I/O wait and load average?

Error “Too many files open” on CentOS 7

If you are encountering Error messages during login with “Too many open files” and the session gets terminated automatically, it is because the open file limit for a user or system exceeds the default setting and  you may wish to change it

@ System Levels

To see the settings for maximum open files,

# cat /proc/sys/fs/file-max
55494980

This value means that the maximum number of files all processes running on the system can open. By default this number will automatically vary according to the amount of RAM in the system. As a rough guideline it will be about 100,000 files per GB of RAM.

 

To override the system wide maximum open files, as edit the /etc/sysctl.conf

# vim /etc/sysctl.conf
 fs.file-max = 80000000

Activate this change to the live system

# sysctl -p

@ User Level

To see the setting for maximum open files for a user

# su - user1
$ ulimit -n
1024

To change the setting, edit the /etc/security/limits.conf

$ vim /etc/security/limits.conf
user - nofile 2048

To change for all users

* - nofile 2048

This set the maximum open files for ALL users to 2048 files. These settings will require a reboot to activate.

References:

  1. How to correct the error “Too many files open” on Red Hat Enterprise Linux

Tools to Show your System Configuration

Tool 1: Display Information about CPU Architecture

[user1@node1 ~]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz
Stepping: 4
CPU MHz: 3200.000
BogoMIPS: 6400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 25344K
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31
.....
.....

Tool 2: List all PCI devices

[user1@node1 ~]# lspci -t -vv
-+-[0000:d7]-+-00.0-[d8]--
| +-01.0-[d9]--
| +-02.0-[da]--
| +-03.0-[db]--
| +-05.0 Intel Corporation Device 2034
| +-05.2 Intel Corporation Sky Lake-E RAS Configuration Registers
| +-05.4 Intel Corporation Device 2036
| +-0e.0 Intel Corporation Device 2058
| +-0e.1 Intel Corporation Device 2059
| +-0f.0 Intel Corporation Device 2058
| +-0f.1 Intel Corporation Device 2059
| +-10.0 Intel Corporation Device 2058
| +-10.1 Intel Corporation Device 2059
| +-12.0 Intel Corporation Sky Lake-E M3KTI Registers
| +-12.1 Intel Corporation Sky Lake-E M3KTI Registers
| +-12.2 Intel Corporation Sky Lake-E M3KTI Registers
| +-12.4 Intel Corporation Sky Lake-E M3KTI Registers
| +-12.5 Intel Corporation Sky Lake-E M3KTI Registers
| +-15.0 Intel Corporation Sky Lake-E M2PCI Registers
| +-16.0 Intel Corporation Sky Lake-E M2PCI Registers
| +-16.4 Intel Corporation Sky Lake-E M2PCI Registers
| \-17.0 Intel Corporation Sky Lake-E M2PCI Registers
.....
.....

Tool 3: List all PCI devices

[user@node1 ~]# lsblk
sda 8:0 0 1.1T 0 disk
├─sda1 8:1 0 200M 0 part /boot/efi
├─sda2 8:2 0 1G 0 part /boot
└─sda3 8:3 0 1.1T 0 part
├─centos-root 253:0 0 50G 0 lvm /
├─centos-swap 253:1 0 4G 0 lvm [SWAP]
└─centos-home 253:2 0 1T 0 lvm

Tool 4: See flags kernel booted with

[user@node1 ~]
BOOT_IMAGE=/vmlinuz-3.10.0-693.el7.x86_64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet LANG=en_US.UTF-8

Tool 5: Display available network interfaces

[root@hpc-gekko1 ~]# ifconfig -a
eno1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
ether XX:XX:XX:XX:XX:XX txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
.....
.....
.....

Tool 6 Using dmidecode to find hardware information
See Using dmidecode to find hardware information

Using qperf to measure network bandwidth and latency

qperf is a network bandwidth and latency measurement tool which works over many transports including TCP/IP, RDMA, UDP and SCTP

Step 1: Installing qperf on both Server and Client

Server> # yum install qperf
Client> # yum install qperf

Step 2: Open Up the Firewall on the Server

# firewall-cmd --permanent --add-port=19765-19766/tcp
# firewall-cmd --reload

The Server listens at TCP Port 19765 by default. Do note that once qperf makes a connection, it will create a control port and data port , the default data port is 19765 but we also need to enable a data port to run test which can be 19766.

Step 3a: Have the Server listen (as qperf server)

Server> $ qperf

Step 4: Connect to qperf Server with qperf Client and measure bandwidth

Client > $ qperf -ip 19766 -t 60 qperf_server_ip_address tcp_bw
tcp_bw:
    bw  =  2.52 GB/sec

Step 5: Connect to qperf Server with qperf Client and measure latency

qperf -vvs qperf_server_ip_address tcp_lat
tcp_lat:
latency = 20.7 us
msg_rate = 48.1 K/sec
loc_send_bytes = 48.2 KB
loc_recv_bytes = 48.2 KB
loc_send_msgs = 48,196
loc_recv_msgs = 48,196
rem_send_bytes = 48.2 KB
rem_recv_bytes = 48.2 KB
rem_send_msgs = 48,197
rem_recv_msgs = 48,197

References:

  1. How to use qperf to measure network bandwidth and latency performance?

Using FIO to measure IO Performance

FIO is a tool for measuring IO performance. You can use FIO to run a user-defined workload and collect the associated performance data.

A good writeup can be found at Fio Basics. For more information, do MAN on fio

Step 1: To Install fio, make sure you activated the epel repository.

# yum install fio

Parameters to be used.

% fio --filename=myfio --size=5GB --direct=1 --rw=randrw --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1
  • –filename
    An entire block size or filename can be specified. For example if you are testing block size filename=/dev/sda:/dev/sdb
  • –size
    Indicate the size of the test file generated.  If you are testing the block size and did not specify the size, the entire block will be used.
  • –direct=1
    This means that the page cache is bypassed. No memory is used.
  • –rw
    Check whether IO is access sequentially or randonly. There are a few options.
    read – Sequential reading
    write – Sequential Write
    randread – random reading
    randwrite – random writing
    readwrite , rw – Mixed, sequential workload
    randrw – Mixed Random Workload
  • –bs
    Block Size. By default 4kB is used.
  • –ioengine=libiao
    libaio  enables asynchronous access from the application level. The option would requires –direct=1
  • –iodepth
    Refers to how many requests.
  • –numjobs
    Specifies the number of processes that start the jobs in parallel
  • –runtime
    Terminate processing after the specified period of time
  • –name
    Name of the job
  • –time-based
    fio will run for the duration of the runtime specified
  • –group_reporting
    Display statistics for groups of jobs as a whole instead of for each individual job. This is especially true if
    numjobs is used;

 

Listing processes for a specific user

Using htop to list users. Which is one of my favourite.

% top -U user1

pstree which displays a tree of processes and can include parents and child processes which make it easier to understand.

% pstree -l -a -p -s user1


where
-l : Long format
-a : Show command line args
-p : Display Linux PIDs
-s : See parents of the selected process

pgrep look up or signal processes based on name and other attributes

% pgrep -l -u user1

References:

  1. Linux list processes by user names