Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13131

Partial writes on multi-client strided files

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.14.0, Lustre 2.12.5
    • Lustre 2.10.4, Lustre 2.10.6, Lustre 2.12.2
    • CentOS 7.6, Lustre/ZFS 2.10.4, 2.10.6, 2.12.2, socklnd and o2ibd
    • 3
    • 9223372036854775807

    Description

      Issue:

      Client partial writes during multi-node N:1 2MB strided writes to striped files. Problem occurs at scale, triggering starts around 50 concurrent files where each file is getting 2MB strided writes from 128 jobs spread across two Lustre client nodes. 

      Sampling of client side errors:

       

      08-Jan 08:36:26 WARNING: partial write   1048576 != 4194304 at offset 543099453440 - skywalker09
      08-Jan 08:36:26 WARNING: write of   4194304 failed at offset 318150541312 Input/output error [retry 0] - skywalker09
      08-Jan 08:37:13 WARNING: partial write   2097152 != 4194304 at offset 144921591808 - skywalker28
      08-Jan 08:37:13 WARNING: write of   4194304 failed at offset 356448731136 Input/output error [retry 0] - skywalker28 
      08-Jan 08:35:48 WARNING: partial write   1048576 != 4194304 at offset 356121575424 - skywalker19
      08-Jan 08:35:48 WARNING: write of   4194304 failed at offset 281496518656 Input/output error [retry 0] - skywalker19 
      08-Jan 08:39:59 WARNING: close failed - Input/output error - skywalker37
      08-Jan 08:34:32 WARNING: partial write   2097152 != 4194304 at offset 464540139520 - skywalker87
      08-Jan 08:34:32 WARNING: write of   4194304 failed at offset 506952941568 Input/output error [retry 0] - skywalker74
      08-Jan 08:38:06 WARNING: write of   2097152 failed at offset 183297376256 Input/output error [retry 0] - skywalker74 
      08-Jan 08:38:06 WARNING: partial write   2097152 != 4194304 at offset 138403643392 - skywalker63 
      08-Jan 08:40:10 WARNING: close failed - Input/output error - skywalker63
       
      

      Lustre Servers: 

      Lustre + ZFS. 2.10.4, 2.10.6, 2.12.2

      Lustre Clients:

      2.10.4, 2.10.6, 2.12.2

      Tested fabrics:

      Socklnd (bonded 40GbE), o2iblnd (FDR & EDR), direct connect and routed o2ib->tcp. 1MB, 4MB and 16MB RPC sizes tested. 

      References to other JIRA tickets that are or may be related:

      LU-6389

      HP-259

      Details:

      A single 600GB file is written to by 128 jobs in a 2MB strided write format. The 128 jobs are spread across two Lustre client nodes. 

        Start 1 Start 2 Start 3 Start 4 Start Last
      job0 0b 256MB 512MB 768MB 556794MB
      job1 2MB 258MB 514MB 770MB 556796MB
      job2 4MB 260MB 518MB 772MB 556798MB
      job127 254MB 510MB 766MB 1022MB 556800MB

       

      The file is written with Lustre stripe options of 4-8 OSTs and a stripe_size of 1MB.

      With a total number of concurrent files being written being ≤ 30 the partial writes are not experienced. In the case of 30 files being written this would reflect 60 Lustre clients grouped by two with 64 jobs (threads) with each pair writing a different strided file. Using the same scenario but scaled up to concurrent files written ≥ 60 which equates to 120 (or more) Lustre clients in pairs performing strided writes to unique files triggers the partial write scenario.

      Making an observation, the combination of the 2MB strided writes and the 1MB OST stripe would result in a significant number of objects per file which scales up as the number of files and concurrent writers increases. Perhaps there is a trigger (I’m spitballing). 

      Error messages witnessed:

      There have been randomly occurring client side error messages. It is not known whether they are resulting in the partial write events or occurring as a result of the partial writes.

       

      tgt_grant_check lfs-OSTxxxx cli xxxxxxxxx claims X GRANT, real grant Y
      ldlm_cancel from xx.xx.xx.xx@tcp
      (lib-move.c:4183:lnet_parse()) xx.xx.xx.xx@o2ib, src xx.xx.xx.xx@o2ib: Bad dest nid xx.xx.xx.xx@tcp (it's my nid but on a different network)
      

       

      No notable server side (MDS / OSS) error messages are observed that would coincide with the partial write errors. The OSS nodes are in an HA configuration and no failovers have occurred as a result of the issue, which if either OSS were becoming overloaded then corosync messages would likely miss and trigger failover.

      Additional information:

      With 2MB strided writes and 1MB OST stripe size it would seem that certain larger RPC sizes / brw_size / ZFS recordsize combinations could trigger significant write amplification. Perhaps enough to exacerbate the issue. 

      I have tried to trigger the issue on a testbed filesystem and have ran through permutations of server-side / client-side versions. socklnd and o2ibd direct, o2ibd lnet routed to socklnd, 1 / 4 / 16MB RPC sizes and I cannot trigger the partial write events. Sixteen 64c AMD Rome clients (eight concurrent files written) isn’t enough to trigger it.

      Attachments

        1. client.txt
          9 kB
          Zhenyu Xu
        2. lustrelog_mds00_158033093x.txt.gz
          18 kB
          Jeff Johnson
        3. lustrelog_oss00_158033093x.txt.gz
          40.32 MB
          Jeff Johnson
        4. lustrelog_oss01_158033093x_A.txt.gz
          1.25 MB
          Jeff Johnson
        5. lustrelog_oss01_158033093x_B.txt.gz
          38.98 MB
          Jeff Johnson
        6. lustrelog_r4i5n12_158033093x.txt.gz
          1.19 MB
          Jeff Johnson
        7. lustrelog_r4i6n3_158033093x.txt.gz
          910 kB
          Jeff Johnson
        8. lustrelog_r4i6n5_158033093x.txt.gz
          556 kB
          Jeff Johnson
        9. messages_oss00_158033093x.txt.gz
          0.4 kB
          Jeff Johnson
        10. messages_oss01_158033093x.txt.gz
          101 kB
          Jeff Johnson
        11. messages_r4i5n12_158033093x.txt.gz
          1 kB
          Jeff Johnson
        12. messages_r4i6n3_158033093x.txt.gz
          2 kB
          Jeff Johnson
        13. messages_r4i6n5_158033093x.txt.gz
          1 kB
          Jeff Johnson
        14. oss.txt
          4 kB
          Zhenyu Xu
        15. oss00_stat.log.gz
          0.9 kB
          Jeff Johnson
        16. oss01_stat.log.gz
          0.9 kB
          Jeff Johnson
        17. palpatine20-lustredbg.txt.gz
          121 kB
          Jeff Johnson
        18. testbed01-oss00-lustredbg.txt.gz
          46 kB
          Jeff Johnson
        19. testbed01-oss00-lustredbg-FULL.txt.gz
          2.66 MB
          Jeff Johnson

        Issue Links

          Activity

            People

              green Oleg Drokin
              aeonjeff Jeff Johnson
              Votes:
              0 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: