Details
-
Improvement
-
Resolution: Fixed
-
Major
-
None
-
Lustre 2.9.0
-
ZFS based OSTs
Lustre 2.9.0 or newer, EL 3.0, EL 3.1
-
9223372036854775807
Description
With ZFS OSTs the single stream read performance for files with 1 OST has dropped and is depending very strongly on some tunables. I use OSTs which are capable of reading a file directly from ZFS with 1.8GB/s and more due to good performance of the zfetch prefetcher. With Lustre read performance is often in the range of 300-500MB/s, which is pretty low compared to solid 1GB/s which we can see with ldiskfs on a hardware RAID controller storage unit.
The explanation for the performance problem is that the RPCs in flight (up to 256) that do read-ahead for the Lustre client are scheduled on the server in more or less random order on the OSS side (as ll_ost_io* kernel threads) and break the zfetch pattern. The best performance is with low max_rpcs_in_flight and large max_read_ahead_per_file_mb values. Unfortunately these values are bad for loads with many streams per client.
The effort in LU-8964 actually makes read performance with ZFS OSTs worse, because the additional parallel tasks are again scheduled in random order.
Measurements are with "dd bs=1MB count=100000 ...", between the measurements the caches were dropped on both OSS and client.
single stream dd bs=1M read bandwidth ----------------------------------------- rpcs ZFS prefetch enabled in max_read_ahead_per_file_mb flight 1 16 32 64 256 ----------------------------------------- 1 335 597 800 817 910 2 379 500 657 705 690 4 335 444 516 558 615 8 339 396 439 471 546 16 378 359 385 404 507 32 333 360 378 379 429 64 332 346 377 377 398 128 375 359 379 381 402 256 339 351 380 378 409
Disabling the ZFS prefetcher completely helps when max_read_ahead_per_file_mb is huge and max_rpcs_in_flight are large because the many (random) streams lead to some sort of prefetching effect. Unfortunately the multiple stream workloads are very bad, so no prefetch is not really a solution.
single stream dd bs=1M read bandwidth ----------------------------------------- rpcs ZFS prefetch disabled in max_read_ahead_per_file_mb flight 1 16 32 64 256 -------------------------- -------------- 1 155 247 286 288 283 2 157 292 360 360 358 4 157 346 461 465 450 8 155 389 580 602 604 16 158 384 614 782 791 32 152 386 600 878 972 64 158 386 597 858 1100 128 155 390 603 863 948 256 160 382 602 859 934
There are probably two approaches to the problem:
- make ZFS zfetch smarter, such that it can cope with the pseudo-randomly ordered read requests from Lustre.
- change the Lustre client such that it has only one RPC in flight to a particular OST object. This would present an acceptable pattern to zfetch and lead to ~1GB/s for the single stream read.
This ticket is about implementing option 2. I prepared patches for tracking the number of read requests per osc_object but have difficulties to limit/enforce them in osc_cache.c. I hope for some hints...