Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8573

IOR: niobuf.c:319:ptlrpc_register_bulk()) ASSERTION( desc->bd_md_count == 0 ) failed

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.9.0
    • Lustre 2.9.0, Lustre 2.10.3
    • None
    • Lustre-master 2.8.56_68_gd4a4c07 build 3430, RHEL7, Spirit performance cluster Revision: d4a4c0795b3befb87d47a5bf441adeba3b1c36f8
    • 3
    • 9223372036854775807

    Description

      Attempted IOR test on both sets of OSTs.
      No progress made on test.
      Client and OST logs are full of:
      OST

      LustreError: 5245:0:(events.c:449:server_bulk_callback()) event type 3, status -113, desc ffff88104dc53200
      LustreError: 9071:0:(events.c:449:server_bulk_callback()) event type 3, status -103, desc ffff880b0c956200
      LustreError: 5245:0:(events.c:449:server_bulk_callback()) event type 5, status -113, desc ffff880b13e95200
      LustreError: 5245:0:(events.c:449:server_bulk_callback()) event type 3, status -113, desc ffff880b13e95200
      LustreError: 5245:0:(events.c:449:server_bulk_callback()) event type 5, status -113, desc ffff881032b5a800
      LustreError: 5245:0:(events.c:449:server_bulk_callback()) event type 3, status -113, desc ffff881032b5a800
      

      Client

      [12378.313197] LustreError: 3138:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff8807442f9c00
      [12378.313223] LustreError: 3136:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff8807442f9800
      [12378.337649] LustreError: 3136:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff8810102c4200
      [12378.337671] LustreError: 3136:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff880f55547400
      [12378.337677] LustreError: 3138:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff880729970000
      [12378.362167] LustreError: 3137:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff880804b78800
      [12378.374324] LustreError: 3137:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff8807299e5200
      [12378.410886] LustreError: 3137:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff880f53ce1c00
      

      Eventually, one client LBUGS

      [12377.942289] LustreError: 3153:0:(niobuf.c:319:ptlrpc_register_bulk()) ASSERTION( desc->bd_md_count == 0 ) failed: 
      [12378.564210] LustreError: 3153:0:(niobuf.c:319:ptlrpc_register_bulk()) LBUG
      [12378.564210] LustreError: 3153:0:(niobuf.c:319:ptlrpc_register_bulk()) LBUG
      [12378.571890] Pid: 3153, comm: ptlrpcd_01_02
      [12378.576468]
      Call Trace:
      [12378.580868]  [<ffffffffa08557d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
      [12378.588652]  [<ffffffffa0855d75>] lbug_with_loc+0x45/0xc0 [libcfs]
      [12378.595597]  [<ffffffffa0c3e661>] ptlrpc_register_bulk+0x831/0x9c0 [ptlrpc]
      [12378.603389]  [<ffffffffa08cace2>] ? LNetMDUnlink+0xe2/0x180 [lnet]
      [12378.610322]  [<ffffffffa0c6be76>] ? sptlrpc_import_sec_ref+0x36/0x40 [ptlrpc]
      [12378.618321]  [<ffffffffa0c3f1af>] ptl_send_rpc+0x1ff/0xda0 [ptlrpc]
      [12378.625361]  [<ffffffffa0c39256>] ptlrpc_check_set.part.23+0x1896/0x1dd0 [ptlrpc]
      [12378.633743]  [<ffffffffa0c397eb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
      [12378.640976]  [<ffffffffa0c643fb>] ptlrpcd_check+0x4eb/0x5e0 [ptlrpc]
      [12378.648095]  [<ffffffffa0c647ab>] ptlrpcd+0x2bb/0x560 [ptlrpc]
      [12378.654613]  [<ffffffff810b8910>] ? default_wake_function+0x0/0x20
      [12378.661550]  [<ffffffffa0c644f0>] ? ptlrpcd+0x0/0x560 [ptlrpc]
      [12378.668067]  [<ffffffff810a5b2f>] kthread+0xcf/0xe0
      [12378.673515]  [<ffffffff810a5a60>] ? kthread+0x0/0xe0
      [12378.679058]  [<ffffffff81646a98>] ret_from_fork+0x58/0x90
      [12378.685086]  [<ffffffff810a5a60>] ? kthread+0x0/0xe0
      [12378.690626]
      [12378.692396] Kernel panic - not syncing: LBUG
      [12378.697164] CPU: 9 PID: 3153 Comm: ptlrpcd_01_02 Tainted: G           OE  ------------   3.10.0-327.28.3.el7.x86_64 #1
      [12378.709098] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.01.0002.082220131453 08/22/2013
      [12378.720548]  ffffffffa0872def 000000003210ab53 ffff881000c1ba80 ffffffff81636453
      [12378.728852]  ffff881000c1bb00 ffffffff8162fce7 ffffffff00000008 ffff881000c1bb10
      [12378.737165]  ffff881000c1bab0 000000003210ab53 ffffffffa0c90d30 0000000000000246
      [12378.745463] Call Trace:
      [12378.748204]  [<ffffffff81636453>] dump_stack+0x19/0x1b
      [12378.753939]  [<ffffffff8162fce7>] panic+0xd8/0x1e7
      [12378.759295]  [<ffffffffa0855ddb>] lbug_with_loc+0xab/0xc0 [libcfs]
      [12378.766235]  [<ffffffffa0c3e661>] ptlrpc_register_bulk+0x831/0x9c0 [ptlrpc]
      [12378.774009]  [<ffffffffa08cace2>] ? LNetMDUnlink+0xe2/0x180 [lnet]
      [12378.780949]  [<ffffffffa0c6be76>] ? sptlrpc_import_sec_ref+0x36/0x40 [ptlrpc]
      [12378.788950]  [<ffffffffa0c3f1af>] ptl_send_rpc+0x1ff/0xda0 [ptlrpc]
      [12378.795981]  [<ffffffffa0c39256>] ptlrpc_check_set.part.23+0x1896/0x1dd0 [ptlrpc]
      [12378.804368]  [<ffffffffa0c397eb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
      [12378.811596]  [<ffffffffa0c643fb>] ptlrpcd_check+0x4eb/0x5e0 [ptlrpc]
      [12378.818725]  [<ffffffffa0c647ab>] ptlrpcd+0x2bb/0x560 [ptlrpc]
      [12378.825237]  [<ffffffff810b8910>] ? wake_up_state+0x20/0x20
      [12378.831492]  [<ffffffffa0c644f0>] ? ptlrpcd_check+0x5e0/0x5e0 [ptlrpc]
      [12378.838778]  [<ffffffff810a5b2f>] kthread+0xcf/0xe0
      [12378.844224]  [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140
      [12378.851510]  [<ffffffff81646a98>] ret_from_fork+0x58/0x90
      [12378.857534]  [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140
      

      vmcore is available on Spirit for analysis

      Attachments

        1. client.spirit-28.log.gz
          2.98 MB
        2. lu-8573.console.tar
          0.2 kB
        3. lu-8573-syslog.tar
          0.2 kB
        4. mds.sprit-3.log.gz
          0.2 kB
        5. oss.aeon-1.log.gz
          0.2 kB
        6. oss.aeon-2.log.gz
          2.29 MB
        7. spirit-11.log.gz
          1.24 MB
        8. spirit-14-lustre-log.20160928_2029-4-lu-8573.bz2
          1.01 MB
        9. spirit-15-lustre-log.20160928_2029-4-lu-8573.bz2
          954 kB
        10. spirit-16-lustre-log.20160928_2029-4-lu-8573.bz2
          1.04 MB
        11. spirit-17.log.gz
          1.05 MB
        12. spirit-20.log.gz
          1.23 MB
        13. spirit-33.log-20160904
          15 kB
        14. spirit-3-lustre-log.20160928_2029-4-lu-8573.bz2
          0.3 kB
        15. spirit-4.log.gz
          2.20 MB
        16. spirit-5.callback.txt.gz
          0.2 kB
        17. spirit-5.log.gz
          1.28 MB
        18. spirit-6.callback.lustre.log.gz
          0.3 kB
        19. spirit-6.log.gz
          1.42 MB
        20. spirit-aeon-1-lustre-log.20160928_2029-4-lu-8573.bz2
          0.3 kB
        21. spirit-aeon-2-lustre-log.20160928_2029-4-lu-8573.bz2
          0.3 kB
        22. sprit-11.callbacks.lustre.log.gz
          1.98 MB

        Issue Links

          Activity

            [LU-8573] IOR: niobuf.c:319:ptlrpc_register_bulk()) ASSERTION( desc->bd_md_count == 0 ) failed

            Closing this ticket again, and using LU-11647 for the new patch. This avoids confusing the fix version for the patch.

            adilger Andreas Dilger added a comment - Closing this ticket again, and using LU-11647 for the new patch. This avoids confusing the fix version for the patch.

            Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: http://review.whamcloud.com/33167
            Subject: LU-8573 ptlrpc: race with reply_in_callback
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: ac6b543a08033ddb5879706c0b3c941d3cdfbef0

            hongchao.zhang Hongchao Zhang added a comment - Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: http://review.whamcloud.com/33167 Subject: LU-8573 ptlrpc: race with reply_in_callback Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ac6b543a08033ddb5879706c0b3c941d3cdfbef0
            pjones Peter Jones added a comment -

            Landed for 2.9

            pjones Peter Jones added a comment - Landed for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23439/
            Subject: LU-8573 lnet: Revert LU-7650 patches
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 59fbfa5f3fa925f676636f1b78c986d05cd295bc

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23439/ Subject: LU-8573 lnet: Revert LU-7650 patches Project: fs/lustre-release Branch: master Current Patch Set: Commit: 59fbfa5f3fa925f676636f1b78c986d05cd295bc

            Received patch from Alex, "Doug thinks that can be introduced by LU-7650.. I just pushed a patch to revert that: http://review.whamcloud.com/23439"
            I was able to successfully complete IOR runs on both sets of servers, will retry ZFS today.

            cliffw Cliff White (Inactive) added a comment - Received patch from Alex, "Doug thinks that can be introduced by LU-7650 .. I just pushed a patch to revert that: http://review.whamcloud.com/23439 " I was able to successfully complete IOR runs on both sets of servers, will retry ZFS today.

            Repeated the fault with ldiskfs, set debug=-1 and dumped lustre logs after the hang occurred.
            MDS = spirit-4 OSS= spirit-5/6, all others are clients.

            cliffw Cliff White (Inactive) added a comment - Repeated the fault with ldiskfs, set debug=-1 and dumped lustre logs after the hang occurred. MDS = spirit-4 OSS= spirit-5/6, all others are clients.

            One note: the IOR test makes no progress due to these errors, it may not be hitting an ASSERT, but the code is completely un-useable.

            cliffw Cliff White (Inactive) added a comment - One note: the IOR test makes no progress due to these errors, it may not be hitting an ASSERT, but the code is completely un-useable.
            green Oleg Drokin added a comment -

            I think the IO errors aside, it's still important to fix the assertion since twe don't want to have the crashes.
            Now if io erors happen to be due to network errors later - then we can dig into this separately, but at least we won't be crashing anything.

            green Oleg Drokin added a comment - I think the IO errors aside, it's still important to fix the assertion since twe don't want to have the crashes. Now if io erors happen to be due to network errors later - then we can dig into this separately, but at least we won't be crashing anything.

            MDS and sample client lustre logs.

            cliffw Cliff White (Inactive) added a comment - MDS and sample client lustre logs.

            Lustre logs from both OSS nodes. backfstype was ZFS for this test, error can be reproduced on either ZFS or ldiskfs

            cliffw Cliff White (Inactive) added a comment - Lustre logs from both OSS nodes. backfstype was ZFS for this test, error can be reproduced on either ZFS or ldiskfs

            Client-side messages

            Oct 26 18:45:02 spirit-11 kernel: Lustre: DEBUG MARKER: == parallel-scale test iorfpp: iorfpp ================================================================ 18:45:02 (1477507502)
            Oct 26 18:45:02 spirit-11 sshd[21261]: Received disconnect from 10.10.1.11: 11: disconnected by user
            Oct 26 18:45:02 spirit-11 sshd[21261]: pam_unix(sshd:session): session closed for user root
            Oct 26 18:45:02 spirit-11 systemd-logind: Removed session 42.
            Oct 26 18:45:05 spirit-11 kernel: LNetError: 3587:0:(o2iblnd_cb.c:1074:kiblnd_init_rdma()) RDMA is too large for peer 192.168.2.2@o2ib (131072), src size: 1048576 dst size: 1048576
            Oct 26 18:45:05 spirit-11 kernel: LNetError: 3586:0:(o2iblnd_cb.c:1698:kiblnd_reply()) Can't setup rdma for GET from 192.168.2.2@o2ib: -90
            Oct 26 18:45:05 spirit-11 kernel: LustreError: 3586:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d076800
            Oct 26 18:45:05 spirit-11 kernel: LustreError: 3586:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d074e00
            Oct 26 18:45:05 spirit-11 kernel: LustreError: 3585:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d074e00
            Oct 26 18:45:05 spirit-11 kernel: Lustre: 3616:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1477507505/real 1477507505]  req@ffff88077d8d6600 x1549270577199552/t0(0) o4->zfstest-OST0001-osc-ffff880ff9be7000@192.168.2.2@o2ib:6/4 lens 608/448 e 0 to 1 dl 1477507517 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
            Oct 26 18:45:05 spirit-11 kernel: Lustre: zfstest-OST0001-osc-ffff880ff9be7000: Connection to zfstest-OST0001 (at 192.168.2.2@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            Oct 26 18:45:05 spirit-11 kernel: Lustre: zfstest-OST0001-osc-ffff880ff9be7000: Connection restored to 192.168.2.2@o2ib (at 192.168.2.2@o2ib)
            Oct 26 18:45:05 spirit-11 kernel: LustreError: 3582:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d074400
            Oct 26 18:45:05 spirit-11 kernel: LustreError: 3582:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d074400
            Oct 26 18:45:05 spirit-11 kernel: LustreError: 3581:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d074400
            Oct 26 18:45:05 spirit-11 kernel: LustreError: 3584:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d074400
            Oct 26 18:45:05 spirit-11 kernel: LNetError: 3587:0:(o2iblnd_cb.c:1074:kiblnd_init_rdma()) Skipped 7 previous similar messages
            Oct 26 18:45:05 spirit-11 kernel: LustreError: 3587:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d076800
            Oct 26 18:45:05 spirit-11 kernel: LustreError: 3588:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88003666fc00
            Oct 26 18:45:05 spirit-11 kernel: LustreError: 3586:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88003666fc00
            Oct 26 18:45:05 spirit-11 kernel: LustreError: 3585:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88003666fc00
            
            cliffw Cliff White (Inactive) added a comment - Client-side messages Oct 26 18:45:02 spirit-11 kernel: Lustre: DEBUG MARKER: == parallel-scale test iorfpp: iorfpp ================================================================ 18:45:02 (1477507502) Oct 26 18:45:02 spirit-11 sshd[21261]: Received disconnect from 10.10.1.11: 11: disconnected by user Oct 26 18:45:02 spirit-11 sshd[21261]: pam_unix(sshd:session): session closed for user root Oct 26 18:45:02 spirit-11 systemd-logind: Removed session 42. Oct 26 18:45:05 spirit-11 kernel: LNetError: 3587:0:(o2iblnd_cb.c:1074:kiblnd_init_rdma()) RDMA is too large for peer 192.168.2.2@o2ib (131072), src size: 1048576 dst size: 1048576 Oct 26 18:45:05 spirit-11 kernel: LNetError: 3586:0:(o2iblnd_cb.c:1698:kiblnd_reply()) Can't setup rdma for GET from 192.168.2.2@o2ib: -90 Oct 26 18:45:05 spirit-11 kernel: LustreError: 3586:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d076800 Oct 26 18:45:05 spirit-11 kernel: LustreError: 3586:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d074e00 Oct 26 18:45:05 spirit-11 kernel: LustreError: 3585:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d074e00 Oct 26 18:45:05 spirit-11 kernel: Lustre: 3616:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1477507505/real 1477507505] req@ffff88077d8d6600 x1549270577199552/t0(0) o4->zfstest-OST0001-osc-ffff880ff9be7000@192.168.2.2@o2ib:6/4 lens 608/448 e 0 to 1 dl 1477507517 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1 Oct 26 18:45:05 spirit-11 kernel: Lustre: zfstest-OST0001-osc-ffff880ff9be7000: Connection to zfstest-OST0001 (at 192.168.2.2@o2ib) was lost; in progress operations using this service will wait for recovery to complete Oct 26 18:45:05 spirit-11 kernel: Lustre: zfstest-OST0001-osc-ffff880ff9be7000: Connection restored to 192.168.2.2@o2ib (at 192.168.2.2@o2ib) Oct 26 18:45:05 spirit-11 kernel: LustreError: 3582:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d074400 Oct 26 18:45:05 spirit-11 kernel: LustreError: 3582:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d074400 Oct 26 18:45:05 spirit-11 kernel: LustreError: 3581:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d074400 Oct 26 18:45:05 spirit-11 kernel: LustreError: 3584:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d074400 Oct 26 18:45:05 spirit-11 kernel: LNetError: 3587:0:(o2iblnd_cb.c:1074:kiblnd_init_rdma()) Skipped 7 previous similar messages Oct 26 18:45:05 spirit-11 kernel: LustreError: 3587:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88100d076800 Oct 26 18:45:05 spirit-11 kernel: LustreError: 3588:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88003666fc00 Oct 26 18:45:05 spirit-11 kernel: LustreError: 3586:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88003666fc00 Oct 26 18:45:05 spirit-11 kernel: LustreError: 3585:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88003666fc00

            People

              hongchao.zhang Hongchao Zhang
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: