Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8573

IOR: niobuf.c:319:ptlrpc_register_bulk()) ASSERTION( desc->bd_md_count == 0 ) failed

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.9.0
    • Lustre 2.9.0, Lustre 2.10.3
    • None
    • Lustre-master 2.8.56_68_gd4a4c07 build 3430, RHEL7, Spirit performance cluster Revision: d4a4c0795b3befb87d47a5bf441adeba3b1c36f8
    • 3
    • 9223372036854775807

    Description

      Attempted IOR test on both sets of OSTs.
      No progress made on test.
      Client and OST logs are full of:
      OST

      LustreError: 5245:0:(events.c:449:server_bulk_callback()) event type 3, status -113, desc ffff88104dc53200
      LustreError: 9071:0:(events.c:449:server_bulk_callback()) event type 3, status -103, desc ffff880b0c956200
      LustreError: 5245:0:(events.c:449:server_bulk_callback()) event type 5, status -113, desc ffff880b13e95200
      LustreError: 5245:0:(events.c:449:server_bulk_callback()) event type 3, status -113, desc ffff880b13e95200
      LustreError: 5245:0:(events.c:449:server_bulk_callback()) event type 5, status -113, desc ffff881032b5a800
      LustreError: 5245:0:(events.c:449:server_bulk_callback()) event type 3, status -113, desc ffff881032b5a800
      

      Client

      [12378.313197] LustreError: 3138:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff8807442f9c00
      [12378.313223] LustreError: 3136:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff8807442f9800
      [12378.337649] LustreError: 3136:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff8810102c4200
      [12378.337671] LustreError: 3136:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff880f55547400
      [12378.337677] LustreError: 3138:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff880729970000
      [12378.362167] LustreError: 3137:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff880804b78800
      [12378.374324] LustreError: 3137:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff8807299e5200
      [12378.410886] LustreError: 3137:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff880f53ce1c00
      

      Eventually, one client LBUGS

      [12377.942289] LustreError: 3153:0:(niobuf.c:319:ptlrpc_register_bulk()) ASSERTION( desc->bd_md_count == 0 ) failed: 
      [12378.564210] LustreError: 3153:0:(niobuf.c:319:ptlrpc_register_bulk()) LBUG
      [12378.564210] LustreError: 3153:0:(niobuf.c:319:ptlrpc_register_bulk()) LBUG
      [12378.571890] Pid: 3153, comm: ptlrpcd_01_02
      [12378.576468]
      Call Trace:
      [12378.580868]  [<ffffffffa08557d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
      [12378.588652]  [<ffffffffa0855d75>] lbug_with_loc+0x45/0xc0 [libcfs]
      [12378.595597]  [<ffffffffa0c3e661>] ptlrpc_register_bulk+0x831/0x9c0 [ptlrpc]
      [12378.603389]  [<ffffffffa08cace2>] ? LNetMDUnlink+0xe2/0x180 [lnet]
      [12378.610322]  [<ffffffffa0c6be76>] ? sptlrpc_import_sec_ref+0x36/0x40 [ptlrpc]
      [12378.618321]  [<ffffffffa0c3f1af>] ptl_send_rpc+0x1ff/0xda0 [ptlrpc]
      [12378.625361]  [<ffffffffa0c39256>] ptlrpc_check_set.part.23+0x1896/0x1dd0 [ptlrpc]
      [12378.633743]  [<ffffffffa0c397eb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
      [12378.640976]  [<ffffffffa0c643fb>] ptlrpcd_check+0x4eb/0x5e0 [ptlrpc]
      [12378.648095]  [<ffffffffa0c647ab>] ptlrpcd+0x2bb/0x560 [ptlrpc]
      [12378.654613]  [<ffffffff810b8910>] ? default_wake_function+0x0/0x20
      [12378.661550]  [<ffffffffa0c644f0>] ? ptlrpcd+0x0/0x560 [ptlrpc]
      [12378.668067]  [<ffffffff810a5b2f>] kthread+0xcf/0xe0
      [12378.673515]  [<ffffffff810a5a60>] ? kthread+0x0/0xe0
      [12378.679058]  [<ffffffff81646a98>] ret_from_fork+0x58/0x90
      [12378.685086]  [<ffffffff810a5a60>] ? kthread+0x0/0xe0
      [12378.690626]
      [12378.692396] Kernel panic - not syncing: LBUG
      [12378.697164] CPU: 9 PID: 3153 Comm: ptlrpcd_01_02 Tainted: G           OE  ------------   3.10.0-327.28.3.el7.x86_64 #1
      [12378.709098] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.01.0002.082220131453 08/22/2013
      [12378.720548]  ffffffffa0872def 000000003210ab53 ffff881000c1ba80 ffffffff81636453
      [12378.728852]  ffff881000c1bb00 ffffffff8162fce7 ffffffff00000008 ffff881000c1bb10
      [12378.737165]  ffff881000c1bab0 000000003210ab53 ffffffffa0c90d30 0000000000000246
      [12378.745463] Call Trace:
      [12378.748204]  [<ffffffff81636453>] dump_stack+0x19/0x1b
      [12378.753939]  [<ffffffff8162fce7>] panic+0xd8/0x1e7
      [12378.759295]  [<ffffffffa0855ddb>] lbug_with_loc+0xab/0xc0 [libcfs]
      [12378.766235]  [<ffffffffa0c3e661>] ptlrpc_register_bulk+0x831/0x9c0 [ptlrpc]
      [12378.774009]  [<ffffffffa08cace2>] ? LNetMDUnlink+0xe2/0x180 [lnet]
      [12378.780949]  [<ffffffffa0c6be76>] ? sptlrpc_import_sec_ref+0x36/0x40 [ptlrpc]
      [12378.788950]  [<ffffffffa0c3f1af>] ptl_send_rpc+0x1ff/0xda0 [ptlrpc]
      [12378.795981]  [<ffffffffa0c39256>] ptlrpc_check_set.part.23+0x1896/0x1dd0 [ptlrpc]
      [12378.804368]  [<ffffffffa0c397eb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
      [12378.811596]  [<ffffffffa0c643fb>] ptlrpcd_check+0x4eb/0x5e0 [ptlrpc]
      [12378.818725]  [<ffffffffa0c647ab>] ptlrpcd+0x2bb/0x560 [ptlrpc]
      [12378.825237]  [<ffffffff810b8910>] ? wake_up_state+0x20/0x20
      [12378.831492]  [<ffffffffa0c644f0>] ? ptlrpcd_check+0x5e0/0x5e0 [ptlrpc]
      [12378.838778]  [<ffffffff810a5b2f>] kthread+0xcf/0xe0
      [12378.844224]  [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140
      [12378.851510]  [<ffffffff81646a98>] ret_from_fork+0x58/0x90
      [12378.857534]  [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140
      

      vmcore is available on Spirit for analysis

      Attachments

        1. client.spirit-28.log.gz
          2.98 MB
        2. lu-8573.console.tar
          0.2 kB
        3. lu-8573-syslog.tar
          0.2 kB
        4. mds.sprit-3.log.gz
          0.2 kB
        5. oss.aeon-1.log.gz
          0.2 kB
        6. oss.aeon-2.log.gz
          2.29 MB
        7. spirit-11.log.gz
          1.24 MB
        8. spirit-14-lustre-log.20160928_2029-4-lu-8573.bz2
          1.01 MB
        9. spirit-15-lustre-log.20160928_2029-4-lu-8573.bz2
          954 kB
        10. spirit-16-lustre-log.20160928_2029-4-lu-8573.bz2
          1.04 MB
        11. spirit-17.log.gz
          1.05 MB
        12. spirit-20.log.gz
          1.23 MB
        13. spirit-33.log-20160904
          15 kB
        14. spirit-3-lustre-log.20160928_2029-4-lu-8573.bz2
          0.3 kB
        15. spirit-4.log.gz
          2.20 MB
        16. spirit-5.callback.txt.gz
          0.2 kB
        17. spirit-5.log.gz
          1.28 MB
        18. spirit-6.callback.lustre.log.gz
          0.3 kB
        19. spirit-6.log.gz
          1.42 MB
        20. spirit-aeon-1-lustre-log.20160928_2029-4-lu-8573.bz2
          0.3 kB
        21. spirit-aeon-2-lustre-log.20160928_2029-4-lu-8573.bz2
          0.3 kB
        22. sprit-11.callbacks.lustre.log.gz
          1.98 MB

        Issue Links

          Activity

            [LU-8573] IOR: niobuf.c:319:ptlrpc_register_bulk()) ASSERTION( desc->bd_md_count == 0 ) failed

            Closing this ticket again, and using LU-11647 for the new patch. This avoids confusing the fix version for the patch.

            adilger Andreas Dilger added a comment - Closing this ticket again, and using LU-11647 for the new patch. This avoids confusing the fix version for the patch.

            Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: http://review.whamcloud.com/33167
            Subject: LU-8573 ptlrpc: race with reply_in_callback
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: ac6b543a08033ddb5879706c0b3c941d3cdfbef0

            hongchao.zhang Hongchao Zhang added a comment - Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: http://review.whamcloud.com/33167 Subject: LU-8573 ptlrpc: race with reply_in_callback Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ac6b543a08033ddb5879706c0b3c941d3cdfbef0
            pjones Peter Jones added a comment -

            Landed for 2.9

            pjones Peter Jones added a comment - Landed for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23439/
            Subject: LU-8573 lnet: Revert LU-7650 patches
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 59fbfa5f3fa925f676636f1b78c986d05cd295bc

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23439/ Subject: LU-8573 lnet: Revert LU-7650 patches Project: fs/lustre-release Branch: master Current Patch Set: Commit: 59fbfa5f3fa925f676636f1b78c986d05cd295bc

            Received patch from Alex, "Doug thinks that can be introduced by LU-7650.. I just pushed a patch to revert that: http://review.whamcloud.com/23439"
            I was able to successfully complete IOR runs on both sets of servers, will retry ZFS today.

            cliffw Cliff White (Inactive) added a comment - Received patch from Alex, "Doug thinks that can be introduced by LU-7650 .. I just pushed a patch to revert that: http://review.whamcloud.com/23439 " I was able to successfully complete IOR runs on both sets of servers, will retry ZFS today.

            Repeated the fault with ldiskfs, set debug=-1 and dumped lustre logs after the hang occurred.
            MDS = spirit-4 OSS= spirit-5/6, all others are clients.

            cliffw Cliff White (Inactive) added a comment - Repeated the fault with ldiskfs, set debug=-1 and dumped lustre logs after the hang occurred. MDS = spirit-4 OSS= spirit-5/6, all others are clients.

            One note: the IOR test makes no progress due to these errors, it may not be hitting an ASSERT, but the code is completely un-useable.

            cliffw Cliff White (Inactive) added a comment - One note: the IOR test makes no progress due to these errors, it may not be hitting an ASSERT, but the code is completely un-useable.
            green Oleg Drokin added a comment -

            I think the IO errors aside, it's still important to fix the assertion since twe don't want to have the crashes.
            Now if io erors happen to be due to network errors later - then we can dig into this separately, but at least we won't be crashing anything.

            green Oleg Drokin added a comment - I think the IO errors aside, it's still important to fix the assertion since twe don't want to have the crashes. Now if io erors happen to be due to network errors later - then we can dig into this separately, but at least we won't be crashing anything.

            MDS and sample client lustre logs.

            cliffw Cliff White (Inactive) added a comment - MDS and sample client lustre logs.

            Lustre logs from both OSS nodes. backfstype was ZFS for this test, error can be reproduced on either ZFS or ldiskfs

            cliffw Cliff White (Inactive) added a comment - Lustre logs from both OSS nodes. backfstype was ZFS for this test, error can be reproduced on either ZFS or ldiskfs

            People

              hongchao.zhang Hongchao Zhang
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: