Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13131

Partial writes on multi-client strided files

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.14.0, Lustre 2.12.5
    • Lustre 2.10.4, Lustre 2.10.6, Lustre 2.12.2
    • CentOS 7.6, Lustre/ZFS 2.10.4, 2.10.6, 2.12.2, socklnd and o2ibd
    • 3
    • 9223372036854775807

    Description

      Issue:

      Client partial writes during multi-node N:1 2MB strided writes to striped files. Problem occurs at scale, triggering starts around 50 concurrent files where each file is getting 2MB strided writes from 128 jobs spread across two Lustre client nodes. 

      Sampling of client side errors:

       

      08-Jan 08:36:26 WARNING: partial write   1048576 != 4194304 at offset 543099453440 - skywalker09
      08-Jan 08:36:26 WARNING: write of   4194304 failed at offset 318150541312 Input/output error [retry 0] - skywalker09
      08-Jan 08:37:13 WARNING: partial write   2097152 != 4194304 at offset 144921591808 - skywalker28
      08-Jan 08:37:13 WARNING: write of   4194304 failed at offset 356448731136 Input/output error [retry 0] - skywalker28 
      08-Jan 08:35:48 WARNING: partial write   1048576 != 4194304 at offset 356121575424 - skywalker19
      08-Jan 08:35:48 WARNING: write of   4194304 failed at offset 281496518656 Input/output error [retry 0] - skywalker19 
      08-Jan 08:39:59 WARNING: close failed - Input/output error - skywalker37
      08-Jan 08:34:32 WARNING: partial write   2097152 != 4194304 at offset 464540139520 - skywalker87
      08-Jan 08:34:32 WARNING: write of   4194304 failed at offset 506952941568 Input/output error [retry 0] - skywalker74
      08-Jan 08:38:06 WARNING: write of   2097152 failed at offset 183297376256 Input/output error [retry 0] - skywalker74 
      08-Jan 08:38:06 WARNING: partial write   2097152 != 4194304 at offset 138403643392 - skywalker63 
      08-Jan 08:40:10 WARNING: close failed - Input/output error - skywalker63
       
      

      Lustre Servers: 

      Lustre + ZFS. 2.10.4, 2.10.6, 2.12.2

      Lustre Clients:

      2.10.4, 2.10.6, 2.12.2

      Tested fabrics:

      Socklnd (bonded 40GbE), o2iblnd (FDR & EDR), direct connect and routed o2ib->tcp. 1MB, 4MB and 16MB RPC sizes tested. 

      References to other JIRA tickets that are or may be related:

      LU-6389

      HP-259

      Details:

      A single 600GB file is written to by 128 jobs in a 2MB strided write format. The 128 jobs are spread across two Lustre client nodes. 

        Start 1 Start 2 Start 3 Start 4 Start Last
      job0 0b 256MB 512MB 768MB 556794MB
      job1 2MB 258MB 514MB 770MB 556796MB
      job2 4MB 260MB 518MB 772MB 556798MB
      job127 254MB 510MB 766MB 1022MB 556800MB

       

      The file is written with Lustre stripe options of 4-8 OSTs and a stripe_size of 1MB.

      With a total number of concurrent files being written being ≤ 30 the partial writes are not experienced. In the case of 30 files being written this would reflect 60 Lustre clients grouped by two with 64 jobs (threads) with each pair writing a different strided file. Using the same scenario but scaled up to concurrent files written ≥ 60 which equates to 120 (or more) Lustre clients in pairs performing strided writes to unique files triggers the partial write scenario.

      Making an observation, the combination of the 2MB strided writes and the 1MB OST stripe would result in a significant number of objects per file which scales up as the number of files and concurrent writers increases. Perhaps there is a trigger (I’m spitballing). 

      Error messages witnessed:

      There have been randomly occurring client side error messages. It is not known whether they are resulting in the partial write events or occurring as a result of the partial writes.

       

      tgt_grant_check lfs-OSTxxxx cli xxxxxxxxx claims X GRANT, real grant Y
      ldlm_cancel from xx.xx.xx.xx@tcp
      (lib-move.c:4183:lnet_parse()) xx.xx.xx.xx@o2ib, src xx.xx.xx.xx@o2ib: Bad dest nid xx.xx.xx.xx@tcp (it's my nid but on a different network)
      

       

      No notable server side (MDS / OSS) error messages are observed that would coincide with the partial write errors. The OSS nodes are in an HA configuration and no failovers have occurred as a result of the issue, which if either OSS were becoming overloaded then corosync messages would likely miss and trigger failover.

      Additional information:

      With 2MB strided writes and 1MB OST stripe size it would seem that certain larger RPC sizes / brw_size / ZFS recordsize combinations could trigger significant write amplification. Perhaps enough to exacerbate the issue. 

      I have tried to trigger the issue on a testbed filesystem and have ran through permutations of server-side / client-side versions. socklnd and o2ibd direct, o2ibd lnet routed to socklnd, 1 / 4 / 16MB RPC sizes and I cannot trigger the partial write events. Sixteen 64c AMD Rome clients (eight concurrent files written) isn’t enough to trigger it.

      Attachments

        1. client.txt
          9 kB
          Zhenyu Xu
        2. lustrelog_mds00_158033093x.txt.gz
          18 kB
          Jeff Johnson
        3. lustrelog_oss00_158033093x.txt.gz
          40.32 MB
          Jeff Johnson
        4. lustrelog_oss01_158033093x_A.txt.gz
          1.25 MB
          Jeff Johnson
        5. lustrelog_oss01_158033093x_B.txt.gz
          38.98 MB
          Jeff Johnson
        6. lustrelog_r4i5n12_158033093x.txt.gz
          1.19 MB
          Jeff Johnson
        7. lustrelog_r4i6n3_158033093x.txt.gz
          910 kB
          Jeff Johnson
        8. lustrelog_r4i6n5_158033093x.txt.gz
          556 kB
          Jeff Johnson
        9. messages_oss00_158033093x.txt.gz
          0.4 kB
          Jeff Johnson
        10. messages_oss01_158033093x.txt.gz
          101 kB
          Jeff Johnson
        11. messages_r4i5n12_158033093x.txt.gz
          1 kB
          Jeff Johnson
        12. messages_r4i6n3_158033093x.txt.gz
          2 kB
          Jeff Johnson
        13. messages_r4i6n5_158033093x.txt.gz
          1 kB
          Jeff Johnson
        14. oss.txt
          4 kB
          Zhenyu Xu
        15. oss00_stat.log.gz
          0.9 kB
          Jeff Johnson
        16. oss01_stat.log.gz
          0.9 kB
          Jeff Johnson
        17. palpatine20-lustredbg.txt.gz
          121 kB
          Jeff Johnson
        18. testbed01-oss00-lustredbg.txt.gz
          46 kB
          Jeff Johnson
        19. testbed01-oss00-lustredbg-FULL.txt.gz
          2.66 MB
          Jeff Johnson

        Issue Links

          Activity

            [LU-13131] Partial writes on multi-client strided files

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38610/
            Subject: LU-13131 osc: Ensure immediate departure of sync write pages
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 0c6454503e1fede795d9b094ee92c91f4290924b

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38610/ Subject: LU-13131 osc: Ensure immediate departure of sync write pages Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 0c6454503e1fede795d9b094ee92c91f4290924b

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38672/
            Subject: LU-13131 osc: Do not wait for grants for too long
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 0dda74eec7a29c98c7b6ee9a99e54c7dbefcabca

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38672/ Subject: LU-13131 osc: Do not wait for grants for too long Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 0dda74eec7a29c98c7b6ee9a99e54c7dbefcabca
            pjones Peter Jones added a comment -

            Nice!

            pjones Peter Jones added a comment - Nice!
            aeonjeff Jeff Johnson added a comment -

            Overnight runs of 300+ client nodes. Never failed. Stick a fork in it, this one's done. #bravo

            aeonjeff Jeff Johnson added a comment - Overnight runs of 300+ client nodes. Never failed. Stick a fork in it, this one's done. #bravo

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37468/
            Subject: LU-13131 osc: Always send all HP RPCs requests
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 05dc36b07fc1ac40988f7994cd37cd239d9a8986

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37468/ Subject: LU-13131 osc: Always send all HP RPCs requests Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 05dc36b07fc1ac40988f7994cd37cd239d9a8986
            green Oleg Drokin added a comment -

            Thank you for the update, great to hear that.

            green Oleg Drokin added a comment - Thank you for the update, great to hear that.
            pjones Peter Jones added a comment -

            Thanks for the update Jeff - this is encouraging news.

            pjones Peter Jones added a comment - Thanks for the update Jeff - this is encouraging news.
            aeonjeff Jeff Johnson added a comment -

            So far the latest patch set has run the reproducer very stable over two days. First runs at 70 clients, subsequent runs at 200+ clients. Have never seen this level of stability running the reproducer. Gathering more nodes to scale up higher but this is looking good.

            aeonjeff Jeff Johnson added a comment - So far the latest patch set has run the reproducer very stable over two days. First runs at 70 clients, subsequent runs at 200+ clients. Have never seen this level of stability running the reproducer. Gathering more nodes to scale up higher but this is looking good.
            pjones Peter Jones added a comment -

            All patches landed for 2.14. aeonjeff we're still keen to get your report back on the effectiveness of the fixes in your testing

            pjones Peter Jones added a comment - All patches landed for 2.14. aeonjeff we're still keen to get your report back on the effectiveness of the fixes in your testing

            Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38672
            Subject: LU-13131 osc: Do not wait for grants for too long
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 319b13d947a6c7a5f8d77813804b6a0721e80b27

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38672 Subject: LU-13131 osc: Do not wait for grants for too long Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 319b13d947a6c7a5f8d77813804b6a0721e80b27

            People

              green Oleg Drokin
              aeonjeff Jeff Johnson
              Votes:
              0 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: