Details
-
Bug
-
Resolution: Fixed
-
Critical
-
None
-
Lustre 2.4.0
-
None
-
3
-
2869
Description
The IO performance seen with sanityn fsx is abysmal:
https://maloo.whamcloud.com/test_sets/9ac72874-392b-11e1-b15b-5254004bbbd3
Running fsx in this test environment was averaging 0.6 IOPS, doing only 2100 operations over 3600 seconds!
Granted, these are virtual machines with a single disk that is likely getting pounded, and fsx is running on 2 separate client nodes, but this performance is going to be a killer. In comparison, ldiskfs completed 2500 operations in 147 seconds (17 IOPS):
https://maloo.whamcloud.com/sub_tests/71c5c4e0-3af1-11e1-8506-5254004bbbd3
I don't think that running fsx on 2 clients should cause the IO operations to be synchronous, since clients can handle async IO recovery today. It may be, however, that fsx is forcing many of the operations to be synchronous by using MMAP and/or O_DIRECT.
It may be that we have to relax the O_DIRECT semantic on ZFS to allow cached IO on the OST, instead of waiting for the data to sync to disk, since there isn't really any mechanism for "cacheless" writes on the OST. The big question is whether the O_DIRECT flag implies "uncached" behaviour on the client, or it implies "synchronous writes" or both?