Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.12.8
-
None
-
CentOS 7.9 - Lustre 2.12.7 servers and 2.12.8 clients with kernel 3.10.0-1160.53.1.el7.x86_64
-
3
-
9223372036854775807
Description
We're re-exporting a Lustre filesystem over NFSv4+krb5 (using nfsd) and I believe we're hitting the problem of poor write performance described in LU-11070 (Partial page write by NFS).
When I write to a file over NFS with nice 1 MB blocks, I can see many 1-page read rpcs to the OSTs (from the client that re-exports Lustre over NFSv4).
The writes done by nfsd must be small or misaligned leading to ll_io_read_page() being called on ll_write_begin():
[1840548.682492] nfsd S ffff900c6a12a100 0 2408 2 0x00000080 [1840548.683626] Call Trace: [1840548.684002] [<ffffffffb05891b9>] schedule+0x29/0x70 [1840548.684810] [<ffffffffc0ed0815>] cl_sync_io_wait+0x2b5/0x3d0 [obdclass] [1840548.685843] [<ffffffffafedadf0>] ? wake_up_state+0x20/0x20 [1840548.686703] [<ffffffffc130c076>] ll_io_read_page+0x306/0x4c0 [lustre] [1840548.687715] [<ffffffffc11c44b2>] ? lov_attr_get+0x22/0x60 [lov] [1840548.688629] [<ffffffffc13200b5>] ll_write_begin+0x505/0xc00 [lustre] [1840548.689611] [<ffffffffaffbdc7f>] generic_file_buffered_write+0x10f/0x270 [1840548.690628] [<ffffffffaffc08e2>] __generic_file_aio_write+0x1e2/0x400 [1840548.691639] [<ffffffffc132dfab>] __generic_file_write_iter+0xcb/0x340 [lustre] [1840548.692763] [<ffffffffc1332304>] vvp_io_write_start+0x504/0xa70 [lustre] [1840548.693844] [<ffffffffc0ecd485>] ? cl_lock_enqueue+0x65/0x120 [obdclass] [1840548.694885] [<ffffffffc0ecf9e8>] cl_io_start+0x68/0x130 [obdclass] [1840548.695886] [<ffffffffc0ed1c5c>] cl_io_loop+0xcc/0x1c0 [obdclass] [1840548.696879] [<ffffffffc12e73bb>] ll_file_io_generic+0x63b/0xc90 [lustre] [1840548.697935] [<ffffffffafe3800f>] ? __kernel_fpu_end+0x1f/0x40 [1840548.698845] [<ffffffffc12e7ea9>] ll_file_aio_write+0x289/0x660 [lustre] [1840548.699852] [<ffffffffb004f5be>] ? rw_copy_check_uvector+0x5e/0x140 [1840548.700826] [<ffffffffb004db6b>] do_sync_readv_writev+0x7b/0xd0 [1840548.701759] [<ffffffffb004f76e>] do_readv_writev+0xce/0x260 [1840548.702652] [<ffffffffc12e7c20>] ? ll_file_splice_read+0x210/0x210 [lustre] [1840548.703722] [<ffffffffc12e8280>] ? ll_file_aio_write+0x660/0x660 [lustre] [1840548.704780] [<ffffffffafed31f1>] ? __wake_up_common_lock+0x91/0xc0 [1840548.705748] [<ffffffffafed2c60>] ? task_rq_unlock+0x20/0x20 [1840548.706688] [<ffffffffb004f995>] vfs_writev+0x35/0x60 [1840548.707528] [<ffffffffc0887e85>] nfsd_vfs_write+0xb5/0x390 [nfsd] [1840548.708498] [<ffffffffc0895b7f>] nfsd4_write+0x1cf/0x240 [nfsd] [1840548.709456] [<ffffffffc089774d>] nfsd4_proc_compound+0x3dd/0x780 [nfsd] [1840548.710538] [<ffffffffc0882810>] nfsd_dispatch+0xe0/0x290 [nfsd] [1840548.711463] [<ffffffffc083d850>] svc_process_common+0x400/0x7d0 [sunrpc] [1840548.712526] [<ffffffffc083dd13>] svc_process+0xf3/0x110 [sunrpc] [1840548.713463] [<ffffffffc088216f>] nfsd+0xdf/0x150 [nfsd] [1840548.714300] [<ffffffffc0882090>] ? nfsd_destroy+0x80/0x80 [nfsd] [1840548.715244] [<ffffffffafec5e61>] kthread+0xd1/0xe0 [1840548.715987] [<ffffffffafec5d90>] ? insert_kthread_work+0x40/0x40 [1840548.716963] [<ffffffffb0595de4>] ret_from_fork_nospec_begin+0xe/0x21 [1840548.717976] [<ffffffffafec5d90>] ? insert_kthread_work+0x40/0x40
This is dramatically affecting write performance to Lustre over NFS... The NFS server/lustre client is busy with many read rpcs in flight to the OST serving the file (98% of 1-page rpcs):
[root@oak-h03v09 .testsrcc]# lctl get_param osc.oak-OST015c*.rpc_stats osc.oak-OST015c-osc-ffff900c98c2a000.rpc_stats= snapshot_time: 1646445606.359753793 (secs.nsecs) read RPCs in flight: 9 write RPCs in flight: 0 pending write pages: 1270 pending read pages: 1 read write pages per rpc rpcs % cum % | rpcs % cum % 1: 940581 98 98 | 206 4 4 2: 8652 0 98 | 76 1 5 4: 810 0 98 | 53 1 6 8: 80 0 98 | 26 0 7 16: 100 0 99 | 26 0 7 32: 618 0 99 | 31 0 8 64: 656 0 99 | 315 6 14 128: 67 0 99 | 341 6 21 256: 231 0 99 | 696 13 35 512: 55 0 99 | 535 10 45 1024: 7886 0 100 | 2727 54 100 read write rpcs in flight rpcs % cum % | rpcs % cum % 0: 0 0 0 | 0 0 0 1: 9849 1 1 | 3854 76 76 2: 51413 5 6 | 635 12 89 3: 2920 0 6 | 214 4 93 4: 1945 0 6 | 106 2 95 5: 1992 0 7 | 65 1 96 6: 5397 0 7 | 49 0 97 7: 9105 0 8 | 28 0 98 8: 30909 3 11 | 21 0 98 9: 845544 88 99 | 58 1 99 10: 662 0 100 | 2 0 100 read write offset rpcs % cum % | rpcs % cum % 0: 1520 0 0 | 1073 21 21 1: 51 0 0 | 5 0 21 2: 128 0 0 | 1 0 21 4: 171 0 0 | 1 0 21 8: 100 0 0 | 2 0 21 16: 188 0 0 | 0 0 21 32: 397 0 0 | 0 0 21 64: 797 0 0 | 0 0 21 128: 1545 0 0 | 0 0 21 256: 3816 0 0 | 4 0 21 512: 5070 0 1 | 8 0 21 1024: 8483 0 2 | 59 1 22 2048: 10840 1 3 | 78 1 24 4096: 17936 1 5 | 131 2 27 8192: 40789 4 9 | 203 4 31 16384: 32536 3 12 | 253 5 36 32768: 35348 3 16 | 296 5 42 65536: 43505 4 21 | 429 8 50 131072: 209057 21 42 | 568 11 61 262144: 358818 37 80 | 909 18 79 524288: 188641 19 100 | 1012 20 100
If I take some Lustre debug logs, I see many small writes:
[root@oak-h03v09 .testsrcc]# grep ll_write_end /tmp/dk5 | grep pos | head 00000080:00200000:11.0:1646438240.275220:0:2401:0:(rw26.c:830:ll_write_end()) pos 84250624, len 3848, copied 3848 00000080:00200000:11.0:1646438240.275335:0:2401:0:(rw26.c:830:ll_write_end()) pos 84254472, len 248, copied 248 00000080:00200000:13.0:1646438240.290720:0:2405:0:(rw26.c:830:ll_write_end()) pos 86269952, len 3848, copied 3848 00000080:00200000:13.0:1646438240.290810:0:2405:0:(rw26.c:830:ll_write_end()) pos 86273800, len 248, copied 248 00000080:00200000:13.0:1646438240.295724:0:2403:0:(rw26.c:830:ll_write_end()) pos 89305088, len 3848, copied 3848 00000080:00200000:13.0:1646438240.295808:0:2403:0:(rw26.c:830:ll_write_end()) pos 89308936, len 248, copied 248 00000080:00200000:11.0:1646438240.325755:0:2401:0:(rw26.c:830:ll_write_end()) pos 84254720, len 3848, copied 3848 00000080:00200000:11.0:1646438240.325846:0:2401:0:(rw26.c:830:ll_write_end()) pos 84258568, len 248, copied 248 00000080:00200000:13.0:1646438240.340791:0:2405:0:(rw26.c:830:ll_write_end()) pos 86274048, len 3848, copied 3848 00000080:00200000:13.0:1646438240.340971:0:2405:0:(rw26.c:830:ll_write_end()) pos 86277896, len 248, copied 248
Also many of these:
/tmp/dk6:00000080:00000001:10.0:1646441821.230362:0:2402:0:(rw26.c:634:ll_prepare_partial_page()) Process leaving /tmp/dk6:00000080:00000001:10.0:1646441821.230589:0:2402:0:(rw26.c:591:ll_prepare_partial_page()) Process entered
I'm attaching a sample of logs with +ALL and +rpc as client-nfs-server_oak-h03v09.log.gz![]()
On the Lustre side, I disabled readahead, statahead, played with max_dirty_mb, max_rpcs_in_flight... without luck.
On the NFS side, I tried sec=krb5/krb5i/krb5p without luck. We run Kernel 3.10.0-1160.53.1.el7.x86_64 on the client / NFS server.
I'm wondering if anyone has some ideas on how to fix this problem. Happy to provide more logs if needed. Thanks!