Altair acquires Univa and Ellexus

Altair, (Nasdaq: ALTR) a global technology company providing solutions in data analytics, product development, and high-performance computing (HPC), today announced the acquisition of Univa, a leading innovator in enterprise-grade workload management, scheduling, and optimization solutions for HPC and artificial intelligence (AI) on-premises and in the cloud.

For more information, see Altair Acquires Univa

How to increase the number of threads created by the NFS daemon for CENTOS 7

Taken from How to increase the number of threads created by the NFS daemon in RHEL 4, 5, 6 and 7?

In case of a NFS server with a high load, it may be advisable to increase the number of the threads created during the nfsd server start up.

Edit the following line in /etc/nfs.conf

% vim /etc/nfs.conf
#[nfsd]
# debug=0
threads=64
# host=
# port=0
# grace-time=90
# lease-time=90
# udp=y
# tcp=y

Testing whether it works….

% cat /proc/net/rpc/nfsd

According to the RH, “The first number is the total number of NFS server threads started. The second number indicates whether at any time all of the threads were running at once. The remaining numbers are a thread count time histogram.”

th 64 0 2.610 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

Find CPU and GPU Performance Headroom using Roofline Analysis

Join Technical Consulting Engineer and HPC programming expert Cedric Andreolli for a session covering:

  • How to perform GPU headroom and GPU caches locality analysis using Advisor Roofline extensions for oneAPI and OpenMP
  • An introduction to a new memory-level Roofline feature that helps pinpoint which specific memory level (L1, L2, L3, or DRAM) is causing the bottleneck
  • A walkthrough of Intel Advisor’s improved user interface

To see video, see https://techdecoded.intel.io/essentials/find-cpu-gpu-performance-headroom-using-roofline-analysis/#gs.fpbz93

NVIDIA to Acquire Arm for $40 Billion, Creating World’s Premier Computing Company for the Age of AI

NVIDIA and SoftBank Group Corp. (SBG) today announced a definitive agreement under which NVIDIA will acquire Arm Limited from SBG and the SoftBank Vision Fund (together, “SoftBank”) in a transaction valued at $40 billion. The transaction is expected to be immediately accretive to NVIDIA’s non-GAAP gross margin and non-GAAP earnings per share.

The combination brings together NVIDIA’s leading AI computing platform with Arm’s vast ecosystem to create the premier computing company for the age of artificial intelligence, accelerating innovation while expanding into large, high-growth markets. SoftBank will remain committed to Arm’s long-term success through its ownership stake in NVIDIA, expected to be under 10 percent.

For more information, see NVIDIA to Acquire Arm for $40 Billion, Creating World’s Premier Computing Company for the Age of AI

Checking nproc limits

One of our Linux Compute Server was showing when a particular was attempting to login on.

failed to execute /bin/bash: resource temporarily unavailable

We suspected that nprocs limits have been breached by the particular user. I found this write-up https://blog.dbi-services.com/linux-how-to-monitor-the-nproc-limit-1/ very prescriptive of the issue I faced.

Extracting information via ps is not useful unless you use the “-L” to show threads, possibly LWP (light-weight process).

 

% ps h -LA -o user | sort | uniq -c | sort -n
1 chrony
1 dbus
1 libstoragemgmt
1 nobody
1 rpc
1 rpcuser
2 avahi
2 user3
2 postfix
3 colord
3 rtkit
4 user1
4 user2
7 polkitd
23 user4
31 user5
34 user6
361 user7
442 user8
556 gdm
563 user9
922 user10
16384 user11
3319 root

You can see that user11 has 16384 threads!

To dig down into what is happening to a selected user. We will use user2 since it has one of the fewest LWP to

% ps -o nlwp,pid,lwp,args -u user2 | sort -n
NLWP PID LWP COMMAND
1 272705 272705 sshd: user2@pts/12
1 273054 273054 sshd: user2@notty
1 273216 273216 /usr/libexec/openssh/sftp-server
1 273406 273406 -bash

nlwp – Number of LWP
lwp – Process of ID of the LWP.

To eliminate the offending user11’s thousands of threads

% pkill -KILL -u user11

References

  1. Linux: how to monitor the nproc limit
  2. How is the nproc hard limit calculated and how do we change the value on CentOS 7

Intel Launches 11th Gen Intel Core and Intel Evo (code-named “Tiger Lake”)

Intel released 11th Gen Intel® Core™ mobile processors with Iris® Xe graphics (code-named “Tiger Lake”). The new processors break the boundaries of performance with unmatched capabilities in productivity, collaboration, creation, gaming and entertainment on ultra-thin-and-light laptops. They also power the first class of Intel Evo platforms, made possible by the Project Athena innovation program. (Credit: Intel Corporation)

  • Intel launches 11th Gen Intel® Core™ processors with Intel® Iris® Xe graphics, the world’s best processors for thin-and-light laptops1, delivering up to 2.7x faster content creation2, more than 20% faster office productivity3 and more than 2x faster gaming plus streaming4 in real-world workflows over competitive products.
  • Intel® Evo™ platform brand introduced for designs based on 11th Gen Intel Core processors with Intel Iris Xe graphics and verified through the Project Athena innovation program’s second-edition specification and key experience indicators (KEIs).
  • More than 150 designs based on 11th Gen Intel Core processors are expected from Acer, Asus, Dell, Dynabook, HP, Lenovo, LG, MSI, Razer, Samsung and others.

QLC support in Pure Flash Array

Read an interesting article by Pure Storage on QLC support into Pure Flash Array//C which is challenging or at least coming close to hybrid (SSD + Spinning Disk) Storage Solution. The technology used. The article is titled “Hybrid Arrays – Not Dead Yet, But … QLC Flash Is Here

According to the article,

Why QLC?

It all comes down to how many bits of data can be stored in each tiny little cell on a flash chip. Most enterprise flash arrays currently use triple-level cell (TLC) chips that store three bits in each cell. A newer generation, quad-level cell (QLC) can store—you guessed it—four bits per cell. 

Better still, it’s more economical to manufacture QLC flash chips than TLC flash. Sounds great, except for two big problems: 

  • QLC flash has far lower endurance, typically limited to fewer than 1,000 program/erase cycles. This is one-tenth the endurance of TLC flash.
  • QLC flash is less performant, with higher latency and lower throughput than TLC. 

Because of these technical challenges, there are only a few QLC-based storage arrays on the market. And the only way those arrays can attain enterprise-grade performance is by overprovisioning (which decreases the amount of usable storage) or by adding a persistent memory tier (which significantly increases cost).

 

How did Pure Storage integrate?

So what has Pure done differently? Crucially, the hardware and software engineers who built QLC support into FlashArray//C built on Pure’s unique vertically integrated architecture. Instead of using flash solid-state drive (SSD) modules like other storage vendors, Pure’s proprietary DirectFlash® modules connect raw flash directly to the FlashArray™ storage via NVMe, which reduces latency and increases throughput. And unlike traditional SSDs that use a flash controller or flash translation layer, DirectFlash is primarily raw flash. The flash translation takes place in the software.

This architecture allows the Purity operating environment to schedule and place data on the storage media with extreme precision, overcoming the technical challenges that have constrained other vendors.

 

For more information do read “Hybrid Arrays – Not Dead Yet, But … QLC Flash Is Here

Disk performance

Storage Benchmarking

There are 4 things that you may want to consider

I/O Latency
I/O latency is defined simply as the time that it takes to complete a single I/O operation. For a conventional spinning disk, there are 3 sources of latency – seek latency, rotational latency and transfer time.

  1. Command Overhead
  2. Seek Latency is how long it takes for the disk head assembly to travel to the track of the disk where the data will be read/written. The fastest high-end server drives today to have a seek time around 4 ms. The average desktop disk is around 9ms (Taken from Wikipedia)
  3. Rotational Latency is the delay taken for the rotation fo the disk to bring the disk sector under the read-write-head. For a 7200 rpm disk, latency is around 4.17 ms (Taken from Wikipedia)
  4. Transfer Time is the time taken for the time it takes to transmit or move data from one place to another. Transfer time equals transfer size divided by data rate.
Typical HDD figures (From Wikipedia)
HDD spindle
speed [rpm]
Average
rotational
latency [ms]
4,200 7.14
5,400 5.56
7,200 4.17
10,000 3.00
15,000 2.00

So the simplistic calculation

overhead + seek + latency + transfer
0.5ms + 4ms  + 4.17ms + 0.8ms = 9.47ms

Acceptable I/O

A question frequently asked is what is the acceptable I/O? According to the Kaminario site, which states that
The Avg. Disk sec/Read performance counter indicates the average time, in seconds, of a read of data from the disk. The average value of the Avg. Disk sec/Read performance counter should be under 10 milliseconds. The maximum value of the Avg. Disk sec/Read performance counter should not exceed 50 milliseconds.

 

References:

  1. What Is an Acceptable I/O Latency?
  2. Disk Performance
  3. Difference between Seek Time and Rotational Latency in Disk Scheduling