Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.11.0, Lustre 2.10.2
-
None
-
Centos 7.4, various Lustre and ZFS versions tested. Lustre clients are 2.10.2_RC2.
-
3
-
9223372036854775807
Description
I'm running an IOR test (IOR-2.10.3) that writes 1GB files to one dataset/directory, then writes 3GB files to another dataset/directory, then reads back the first dataset. This test sequence is run 25 times. My filesystem is able to do 14-16GB/sec writes, and most iterations of this test will produce that bandwidth. Problem is that out of the 25 iterations, a couple/few of the test iterations turn in significantly lower results often in the 5-10GB/sec range.
I initially suspected hardware issues, but testing of components including each individual disk drive showed everything working properly, and I've seen nothing in the logs when running the test above reporting any problem. So, I started building and testing various combinations of Lustre and ZFS. The hardware, clients and server OS have been constant for each of the tests. Only SPL/ZFS and Lustre on the server have changed from test to test.
It appears to boil down to the problem having been introduced in the Lustre 2.10.x branch. I have not seen the problem occur in the Lustre 2.9 builds I've done. I've built Lustre 2.9 with ZFS 0.7.3 and seen no issue. I've build Lustre 2.10.x with ZFS 0.6.5.7 and do observe the issue. Every build I've done with Lustre 2.10.x (several) showed the issue.