Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11734

LNet crashes with 2.12.0-rc1: lnet_attach_rsp_tracker() ASSERTION(md->md_rspt_ptr == ((void *)0)) failed

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.12.0
    • Lustre 2.12.0
    • None
    • lustre 2.12.0-rc1
    • 3
    • 9223372036854775807

    Description

      In my testing I see at bring up:

      2018-12-05T14:24:14.274614-05:00 ninja84.ccs.ornl.gov kernel: Lustre: Lustre: Build Version: 2.12.0_RC1_dirty

      2018-12-05T14:24:14.671611-05:00 ninja84.ccs.ornl.gov kernel: LNetError: 3759:0:(lib-move.c:4429:lnet_attach_rsp_tracker()) ASSERTION( md->md_r

      spt_ptr == ((void *)0) ) failed:

      2018-12-05T14:24:14.671682-05:00 ninja84.ccs.ornl.gov kernel: LNetError: 3759:0:(lib-move.c:4429:lnet_attach_rsp_tracker()) LBUG

      2018-12-05T14:24:14.671716-05:00 ninja84.ccs.ornl.gov kernel: Pid: 3759, comm: monitor_thread 3.10.0-862.3.2.el7.x86_64 #1 SMP Tue May 15 18:22

      :15 EDT 2018

      2018-12-05T14:24:14.681179-05:00 ninja84.ccs.ornl.gov kernel: Call Trace:

      2018-12-05T14:24:14.685051-05:00 ninja84.ccs.ornl.gov kernel: Lustre: Echo OBD driver; http://www.lustre.org/

      2018-12-05T14:24:14.695841-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc0c267cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]

      2018-12-05T14:24:14.697263-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc0c2687c>] lbug_with_loc+0x4c/0xa0 [libcfs]

      2018-12-05T14:24:14.703501-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc102c257>] lnet_attach_rsp_tracker.isra.29+0xd7/0xe0 [lnet]

      2018-12-05T14:24:14.711154-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc1033ba1>] LNetGet+0x5d1/0xa90 [lnet]

      2018-12-05T14:24:14.716870-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc103c0b9>] lnet_check_routers+0x399/0xbf0 [lnet]

      2018-12-05T14:24:14.723556-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc10354bf>] lnet_monitor_thread+0x4ff/0x560 [lnet]

      2018-12-05T14:24:14.730325-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffff8dabb161>] kthread+0xd1/0xe0

      2018-12-05T14:24:14.735265-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffff8e120677>] ret_from_fork_nospec_end+0x0/0x39

      2018-12-05T14:24:14.741596-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffffffffff>] 0xffffffffffffffff

      Attachments

        Issue Links

          Activity

            [LU-11734] LNet crashes with 2.12.0-rc1: lnet_attach_rsp_tracker() ASSERTION(md->md_rspt_ptr == ((void *)0)) failed

            I don't think this particular patch is relevant to that ticket since LU-10669 was before health. But it could be related in the sense that an MD can be used multiple times, which could lead to a corner scenario causing the assert:

            lib-msg.c:346:lnet_msg_detach_md()) ASSERTION( md->md_refcount >= 0

            ashehata Amir Shehata (Inactive) added a comment - I don't think this particular patch is relevant to that ticket since LU-10669 was before health. But it could be related in the sense that an MD can be used multiple times, which could lead to a corner scenario causing the assert: lib-msg.c:346:lnet_msg_detach_md()) ASSERTION( md->md_refcount >= 0

            This looks related to the issues seen in LU-10669?

            adilger Andreas Dilger added a comment - This looks related to the issues seen in LU-10669 ?

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33794
            Subject: LU-11734 lnet: handle multi-md usage
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 14bf25aa1c1723ce4a0a88341e6e62299994cb8a

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33794 Subject: LU-11734 lnet: handle multi-md usage Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 14bf25aa1c1723ce4a0a88341e6e62299994cb8a

            working on a patch. Should have it up soon.

            ashehata Amir Shehata (Inactive) added a comment - working on a patch. Should have it up soon.

            Basic IB to IB mlx4 using default OFED stack on x86 servers.

            simmonsja James A Simmons added a comment - Basic IB to IB mlx4 using default OFED stack on x86 servers.

            Amir, could you please take a look?

             

            James, could you please provide a bit of basic info about your network config?  LND type, routing, module parameters, OFED version.

            adilger Andreas Dilger added a comment - Amir, could you please take a look?   James, could you please provide a bit of basic info about your network config?  LND type, routing, module parameters, OFED version.

            People

              ashehata Amir Shehata (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: