Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15620

Poor write performance with nfsd over Lustre

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.12.8
    • None
    • CentOS 7.9 - Lustre 2.12.7 servers and 2.12.8 clients with kernel 3.10.0-1160.53.1.el7.x86_64
    • 3
    • 9223372036854775807

    Description

      We're re-exporting a Lustre filesystem over NFSv4+krb5 (using nfsd) and I believe we're hitting the problem of poor write performance described in LU-11070 (Partial page write by NFS).

      When I write to a file over NFS with nice 1 MB blocks, I can see many 1-page read rpcs to the OSTs (from the client that re-exports Lustre over NFSv4).

      The writes done by nfsd must be small or misaligned leading to ll_io_read_page() being called on ll_write_begin():

      [1840548.682492] nfsd            S ffff900c6a12a100     0  2408      2 0x00000080
      [1840548.683626] Call Trace:
      [1840548.684002]  [<ffffffffb05891b9>] schedule+0x29/0x70
      [1840548.684810]  [<ffffffffc0ed0815>] cl_sync_io_wait+0x2b5/0x3d0 [obdclass]
      [1840548.685843]  [<ffffffffafedadf0>] ? wake_up_state+0x20/0x20
      [1840548.686703]  [<ffffffffc130c076>] ll_io_read_page+0x306/0x4c0 [lustre]
      [1840548.687715]  [<ffffffffc11c44b2>] ? lov_attr_get+0x22/0x60 [lov]
      [1840548.688629]  [<ffffffffc13200b5>] ll_write_begin+0x505/0xc00 [lustre]
      [1840548.689611]  [<ffffffffaffbdc7f>] generic_file_buffered_write+0x10f/0x270
      [1840548.690628]  [<ffffffffaffc08e2>] __generic_file_aio_write+0x1e2/0x400
      [1840548.691639]  [<ffffffffc132dfab>] __generic_file_write_iter+0xcb/0x340 [lustre]
      [1840548.692763]  [<ffffffffc1332304>] vvp_io_write_start+0x504/0xa70 [lustre]
      [1840548.693844]  [<ffffffffc0ecd485>] ? cl_lock_enqueue+0x65/0x120 [obdclass]
      [1840548.694885]  [<ffffffffc0ecf9e8>] cl_io_start+0x68/0x130 [obdclass]
      [1840548.695886]  [<ffffffffc0ed1c5c>] cl_io_loop+0xcc/0x1c0 [obdclass]
      [1840548.696879]  [<ffffffffc12e73bb>] ll_file_io_generic+0x63b/0xc90 [lustre]
      [1840548.697935]  [<ffffffffafe3800f>] ? __kernel_fpu_end+0x1f/0x40
      [1840548.698845]  [<ffffffffc12e7ea9>] ll_file_aio_write+0x289/0x660 [lustre]
      [1840548.699852]  [<ffffffffb004f5be>] ? rw_copy_check_uvector+0x5e/0x140
      [1840548.700826]  [<ffffffffb004db6b>] do_sync_readv_writev+0x7b/0xd0
      [1840548.701759]  [<ffffffffb004f76e>] do_readv_writev+0xce/0x260
      [1840548.702652]  [<ffffffffc12e7c20>] ? ll_file_splice_read+0x210/0x210 [lustre]
      [1840548.703722]  [<ffffffffc12e8280>] ? ll_file_aio_write+0x660/0x660 [lustre]
      [1840548.704780]  [<ffffffffafed31f1>] ? __wake_up_common_lock+0x91/0xc0
      [1840548.705748]  [<ffffffffafed2c60>] ? task_rq_unlock+0x20/0x20
      [1840548.706688]  [<ffffffffb004f995>] vfs_writev+0x35/0x60
      [1840548.707528]  [<ffffffffc0887e85>] nfsd_vfs_write+0xb5/0x390 [nfsd]
      [1840548.708498]  [<ffffffffc0895b7f>] nfsd4_write+0x1cf/0x240 [nfsd]
      [1840548.709456]  [<ffffffffc089774d>] nfsd4_proc_compound+0x3dd/0x780 [nfsd]
      [1840548.710538]  [<ffffffffc0882810>] nfsd_dispatch+0xe0/0x290 [nfsd]
      [1840548.711463]  [<ffffffffc083d850>] svc_process_common+0x400/0x7d0 [sunrpc]
      [1840548.712526]  [<ffffffffc083dd13>] svc_process+0xf3/0x110 [sunrpc]
      [1840548.713463]  [<ffffffffc088216f>] nfsd+0xdf/0x150 [nfsd]
      [1840548.714300]  [<ffffffffc0882090>] ? nfsd_destroy+0x80/0x80 [nfsd]
      [1840548.715244]  [<ffffffffafec5e61>] kthread+0xd1/0xe0
      [1840548.715987]  [<ffffffffafec5d90>] ? insert_kthread_work+0x40/0x40
      [1840548.716963]  [<ffffffffb0595de4>] ret_from_fork_nospec_begin+0xe/0x21
      [1840548.717976]  [<ffffffffafec5d90>] ? insert_kthread_work+0x40/0x40
       

      This is dramatically affecting write performance to Lustre over NFS... The NFS server/lustre client is busy with many read rpcs in flight to the OST serving the file (98% of 1-page rpcs):

      [root@oak-h03v09 .testsrcc]# lctl get_param osc.oak-OST015c*.rpc_stats
      osc.oak-OST015c-osc-ffff900c98c2a000.rpc_stats=
      snapshot_time:         1646445606.359753793 (secs.nsecs)
      read RPCs in flight:  9
      write RPCs in flight: 0
      pending write pages:  1270
      pending read pages:   1
      
      			read			write
      pages per rpc         rpcs   % cum % |       rpcs   % cum %
      1:		    940581  98  98   |        206   4   4
      2:		      8652   0  98   |         76   1   5
      4:		       810   0  98   |         53   1   6
      8:		        80   0  98   |         26   0   7
      16:		       100   0  99   |         26   0   7
      32:		       618   0  99   |         31   0   8
      64:		       656   0  99   |        315   6  14
      128:		        67   0  99   |        341   6  21
      256:		       231   0  99   |        696  13  35
      512:		        55   0  99   |        535  10  45
      1024:		      7886   0 100   |       2727  54 100
      
      			read			write
      rpcs in flight        rpcs   % cum % |       rpcs   % cum %
      0:		         0   0   0   |          0   0   0
      1:		      9849   1   1   |       3854  76  76
      2:		     51413   5   6   |        635  12  89
      3:		      2920   0   6   |        214   4  93
      4:		      1945   0   6   |        106   2  95
      5:		      1992   0   7   |         65   1  96
      6:		      5397   0   7   |         49   0  97
      7:		      9105   0   8   |         28   0  98
      8:		     30909   3  11   |         21   0  98
      9:		    845544  88  99   |         58   1  99
      10:		       662   0 100   |          2   0 100
      
      			read			write
      offset                rpcs   % cum % |       rpcs   % cum %
      0:		      1520   0   0   |       1073  21  21
      1:		        51   0   0   |          5   0  21
      2:		       128   0   0   |          1   0  21
      4:		       171   0   0   |          1   0  21
      8:		       100   0   0   |          2   0  21
      16:		       188   0   0   |          0   0  21
      32:		       397   0   0   |          0   0  21
      64:		       797   0   0   |          0   0  21
      128:		      1545   0   0   |          0   0  21
      256:		      3816   0   0   |          4   0  21
      512:		      5070   0   1   |          8   0  21
      1024:		      8483   0   2   |         59   1  22
      2048:		     10840   1   3   |         78   1  24
      4096:		     17936   1   5   |        131   2  27
      8192:		     40789   4   9   |        203   4  31
      16384:		     32536   3  12   |        253   5  36
      32768:		     35348   3  16   |        296   5  42
      65536:		     43505   4  21   |        429   8  50
      131072:		    209057  21  42   |        568  11  61
      262144:		    358818  37  80   |        909  18  79
      524288:		    188641  19 100   |       1012  20 100
      

      If I take some Lustre debug logs, I see many small writes:

      [root@oak-h03v09 .testsrcc]# grep ll_write_end /tmp/dk5  | grep pos | head
      00000080:00200000:11.0:1646438240.275220:0:2401:0:(rw26.c:830:ll_write_end()) pos 84250624, len 3848, copied 3848
      00000080:00200000:11.0:1646438240.275335:0:2401:0:(rw26.c:830:ll_write_end()) pos 84254472, len 248, copied 248
      00000080:00200000:13.0:1646438240.290720:0:2405:0:(rw26.c:830:ll_write_end()) pos 86269952, len 3848, copied 3848
      00000080:00200000:13.0:1646438240.290810:0:2405:0:(rw26.c:830:ll_write_end()) pos 86273800, len 248, copied 248
      00000080:00200000:13.0:1646438240.295724:0:2403:0:(rw26.c:830:ll_write_end()) pos 89305088, len 3848, copied 3848
      00000080:00200000:13.0:1646438240.295808:0:2403:0:(rw26.c:830:ll_write_end()) pos 89308936, len 248, copied 248
      00000080:00200000:11.0:1646438240.325755:0:2401:0:(rw26.c:830:ll_write_end()) pos 84254720, len 3848, copied 3848
      00000080:00200000:11.0:1646438240.325846:0:2401:0:(rw26.c:830:ll_write_end()) pos 84258568, len 248, copied 248
      00000080:00200000:13.0:1646438240.340791:0:2405:0:(rw26.c:830:ll_write_end()) pos 86274048, len 3848, copied 3848
      00000080:00200000:13.0:1646438240.340971:0:2405:0:(rw26.c:830:ll_write_end()) pos 86277896, len 248, copied 248
      

      Also many of these:

      /tmp/dk6:00000080:00000001:10.0:1646441821.230362:0:2402:0:(rw26.c:634:ll_prepare_partial_page()) Process leaving
      /tmp/dk6:00000080:00000001:10.0:1646441821.230589:0:2402:0:(rw26.c:591:ll_prepare_partial_page()) Process entered
      

      I'm attaching a sample of logs with +ALL and +rpc as client-nfs-server_oak-h03v09.log.gz

      On the Lustre side, I disabled readahead, statahead, played with max_dirty_mb, max_rpcs_in_flight... without luck.
      On the NFS side, I tried sec=krb5/krb5i/krb5p without luck. We run Kernel 3.10.0-1160.53.1.el7.x86_64 on the client / NFS server.

      I'm wondering if anyone has some ideas on how to fix this problem. Happy to provide more logs if needed. Thanks!

      Attachments

        Activity

          People

            paf0186 Patrick Farrell
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: