Faulty disks accepting I/O request and not returning any failure for GPFS

We have encountered a situation where a defunct disk was accepting IO request and did not return any failure in time. As a result, these IO requests hanged there till time out (default 10 seconds). Typically, Spectrum Scale/GPFS will fail to read or write a disk, the failure is written in log and we have to shift IO to other available disks which should be quick.

Normally such operations should return in 20 milliseconds or less. When we have IO timeout, this request has wasted us
10 seconds / 20 milliseconds = 500 times of time. Even if Spectrum Scale/GPFS is able to choose a fast disk in the second attempt, we are much slower than normal.

Due to the utilization of striping technology, a bad/slow disks always affects IO of many files, much more than the situation without striping. IO on the same file involves more than several disks, and the IO has to wait for the slowest request to return. So a bad/slow disk may have considerable influence on Spectrum Scale/GPFS performance.