Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7351

LNet router crash during bring up of infiniband interface.

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.8.0
    • None
    • Cray routers running Lustre 2.7.61 in an SLES11 SP3 environment.
    • 2
    • 9223372036854775807

    Description

      In our testing on our medium size Cray system we encountered the following crash while attempting to bring up LNet on the routers.

      [2015-10-26 15:51:14][c0-0c0s6n0]Lustre: kgnilnd build version: 2.7.61-DNE2-1.0502.0.2.7-jsimmons-Unknown-2015-10-21-11:16
      [2015-10-26 15:51:14][c0-0c0s6n0]LNet: Added LNI 12@gni2 [16/8192/0/0]
      [2015-10-26 15:51:15][c0-0c0s6n0]LNetError: 149:0:(o2iblnd_cb.c:2239:kiblnd_passive_connect()) Can't accept conn from 10.36.226.4@o2ib on NA (ib0:0:10.36.223.1): bad dst nid 10.36.223.1@o2ib
      [2015-10-26 15:51:15][c0-0c0s6n0]BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
      [2015-10-26 15:51:15][c0-0c0s6n0]IP: [<ffffffffa03e3e5b>] kiblnd_passive_connect+0xfb/0x16b0 [ko2iblnd]
      [2015-10-26 15:51:15][c0-0c0s6n0]PGD 3dc9ae067 PUD 3ddc0c067 PMD 0
      [2015-10-26 15:51:15][c0-0c0s6n0]Oops: 0000 1 SMP
      [2015-10-26 15:51:15][c0-0c0s6n0]CPU 5
      [2015-10-26 15:51:15][c0-0c0s6n0]Modules linked in: ko2iblnd kgnilnd lnet crc32c libcfs binfmt_misc rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core mlx4_en mlx4_ib ib_sa ib_mad ib_core mlx4_core compat nic_compat dm_mod kdreg gpcd_gem ipogif_gem kgni_gem hwerr(P) rca hss_os(P) heartbeat simplex(P) ghal_gem cgm craytrace
      [2015-10-26 15:51:15][c0-0c0s6n0]Pid: 149, comm: kworker/5:1 Tainted: P 3.0.101-0.46.1_1.0502.8871-cray_gem_s #1
      [2015-10-26 15:51:15][c0-0c0s6n0]RIP: 0010:[<ffffffffa03e3e5b>] [<ffffffffa03e3e5b>] kiblnd_passive_connect+0xfb/0x16b0 [ko2iblnd]
      [2015-10-26 15:51:15][c0-0c0s6n0]RSP: 0018:ffff8803f1b0fb10 EFLAGS: 00010246
      [2015-10-26 15:51:15][c0-0c0s6n0]RAX: 000000000000003f RBX: ffffffffa03ee513 RCX: ffffffff81368c50
      [2015-10-26 15:51:15][c0-0c0s6n0]RDX: 0000000000000100 RSI: 0000000000000000 RDI: ffff8803e8ac7680
      [2015-10-26 15:51:15][c0-0c0s6n0]RBP: ffff8803f1b0fbd0 R08: 0000000000000005 R09: 0000000000000005
      [2015-10-26 15:51:15][c0-0c0s6n0]R10: 0000000000000003 R11: 00000000ffffffff R12: 0000000000000012
      [2015-10-26 15:51:15][c0-0c0s6n0]R13: 0000000000000000 R14: 0000000000000000 R15: ffff8803c0e58620
      [2015-10-26 15:51:15][c0-0c0s6n0]FS: 00007f9e343b7700(0000) GS:ffff880407d40000(0000) knlGS:0000000000000000
      [2015-10-26 15:51:15][c0-0c0s6n0]CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [2015-10-26 15:51:15][c0-0c0s6n0]CR2: 0000000000000080 CR3: 00000003dc9bd000 CR4: 00000000000007e0
      [2015-10-26 15:51:15][c0-0c0s6n0]DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [2015-10-26 15:51:15][c0-0c0s6n0]DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [2015-10-26 15:51:15][c0-0c0s6n0]LNet: Added LNI 10.36.223.1@o2ib [63/2560/0/180]
      [2015-10-26 15:51:15][c0-0c0s6n0]Process kworker/5:1 (pid: 149, threadinfo ffff8803f1b0c000, task ffff8803f1b09040)
      [2015-10-26 15:51:15][c0-0c0s6n0]Stack:
      [2015-10-26 15:51:15][c0-0c0s6n0] ffff8803c0e58620 ffffffffa02f6080 0000000000000000 ffff8803ea8bcc00
      [2015-10-26 15:51:15][c0-0c0s6n0] ffffffffa02f6080 000500000a24e204 0000000000000001 ffff8803e4740000
      [2015-10-26 15:51:15][c0-0c0s6n0] 000300120be91b91 0000000000000000 000010000000003f ffffffffa017b636
      [2015-10-26 15:51:15][c0-0c0s6n0]Call Trace:
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa03e59bd>] kiblnd_cm_callback+0x5ad/0x2070 [ko2iblnd]
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa024145b>] cma_req_handler+0x1eb/0x550 [rdma_cm]
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa016ff57>] cm_process_work+0x27/0x130 [ib_cm]
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa0171fb0>] cm_req_handler+0x750/0xa00 [ib_cm]
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa0172385>] cm_work_handler+0x125/0xf4c [ib_cm]
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81060953>] process_one_work+0x163/0x440
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81063473>] worker_thread+0x183/0x400
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81067ace>] kthread+0x9e/0xb0
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81490074>] kernel_thread_helper+0x4/0x10
      [2015-10-26 15:51:15][c0-0c0s6n0]Code: 0f 84 da 01 00 00 66 3d 00 11 0f 84 d0 01 00 00 66 c7 45 84 12 00 45 31 f6 48 8b 05 38 03 01 00 ba 00 01 00 00 8b 00 66 89 45 90
      [2015-10-26 15:51:15][c0-0c0s6n0] 8b 86 80 00 00 00 85 c0 0f 45 d0 48 8b bd 58 ff ff ff 48 8d
      [2015-10-26 15:51:15][c0-0c0s6n0]RIP [<ffffffffa03e3e5b>] kiblnd_passive_connect+0xfb/0x16b0 [ko2iblnd]
      [2015-10-26 15:51:15][c0-0c0s6n0] RSP <ffff8803f1b0fb10>
      [2015-10-26 15:51:15][c0-0c0s6n0]CR2: 0000000000000080
      [2015-10-26 15:51:15][c0-0c0s6n0]--[ end trace 311d9fd8dd61b1cf ]--
      [2015-10-26 15:51:15][c0-0c0s6n0]Kernel panic - not syncing: Fatal exception
      [2015-10-26 15:51:15][c0-0c0s6n0]Pid: 149, comm: kworker/5:1 Tainted: P D 3.0.101-0.46.1_1.0502.8871-cray_gem_s #1
      [2015-10-26 15:51:15][c0-0c0s6n0]Call Trace:
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81006651>] try_stack_unwind+0x161/0x1a0
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81004eb9>] dump_trace+0x89/0x430
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff810060bc>] show_trace_log_lvl+0x5c/0x80
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff810060f5>] show_trace+0x15/0x20
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff8148b31c>] dump_stack+0x79/0x84
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff8148b3bb>] panic+0x94/0x1da
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81005ed8>] oops_end+0xa8/0xe0
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81027589>] no_context+0xf9/0x260
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81027855>] __bad_area_nosemaphore+0x165/0x1f0
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff810278f3>] bad_area_nosemaphore+0x13/0x20
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81027e4e>] do_page_fault+0x2fe/0x440
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff8148e8cf>] page_fault+0x1f/0x30
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa03e3e5b>] kiblnd_passive_connect+0xfb/0x16b0 [ko2iblnd]
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa03e59bd>] kiblnd_cm_callback+0x5ad/0x2070 [ko2iblnd]
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa024145b>] cma_req_handler+0x1eb/0x550 [rdma_cm]
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa016ff57>] cm_process_work+0x27/0x130 [ib_cm]
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa0171fb0>] cm_req_handler+0x750/0xa00 [ib_cm]
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa0172385>] cm_work_handler+0x125/0xf4c [ib_cm]
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81060953>] process_one_work+0x163/0x440
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81063473>] worker_thread+0x183/0x400
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81067ace>] kthread+0x9e/0xb0
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81490074>] kernel_thread_helper+0x4/0x10

      Attachments

        Issue Links

          Activity

            [LU-7351] LNet router crash during bring up of infiniband interface.
            pjones Peter Jones added a comment -

            Great - thanks for confirming James

            pjones Peter Jones added a comment - Great - thanks for confirming James

            I agree. The patch from LU-3322 resolves this.

            simmonsja James A Simmons added a comment - I agree. The patch from LU-3322 resolves this.
            pjones Peter Jones added a comment -

            Seems like this is a duplicate LU-3322. James, please speak up if you think otherwise

            pjones Peter Jones added a comment - Seems like this is a duplicate LU-3322 . James, please speak up if you think otherwise

            Hi James. I see you successfully tested LU-3322. Does that mean this issue has been resolved?

            doug Doug Oucharek (Inactive) added a comment - Hi James. I see you successfully tested LU-3322 . Does that mean this issue has been resolved?

            Can we wait until LU-3322 is settled. I noticed that patch has changed again but is now disliked. I want to wait until a agreed on solution is presented before I will try to test it again. Is that okay?

            simmonsja James A Simmons added a comment - Can we wait until LU-3322 is settled. I noticed that patch has changed again but is now disliked. I want to wait until a agreed on solution is presented before I will try to test it again. Is that okay?

            James: Can I close this linked to LU-3322?

            doug Doug Oucharek (Inactive) added a comment - James: Can I close this linked to LU-3322 ?

            I added the latest LU-3322 patch and I'm not seeing the crashes anymore. This is at a smallest scale that I saw this problem before but we don't know if it resolves this at titan scales.

            simmonsja James A Simmons added a comment - I added the latest LU-3322 patch and I'm not seeing the crashes anymore. This is at a smallest scale that I saw this problem before but we don't know if it resolves this at titan scales.

            Nope the problem is not cmd->context being NULL. I'm going to give it another run to see what it is.

            simmonsja James A Simmons added a comment - Nope the problem is not cmd->context being NULL. I'm going to give it another run to see what it is.

            This sounds like it is a blocker for the 2.8 release as stated. I marked it as such, but you can correct it if I'm wrong.

            morrone Christopher Morrone (Inactive) added a comment - This sounds like it is a blocker for the 2.8 release as stated. I marked it as such, but you can correct it if I'm wrong.

            James: we suspect that cmid->context is coming back NULL when we don't expect such a thing to happen. Can you verify the line of the crash with your binary? I'm not sure that the binary Amir used is equivalent to yours. Once we know for sure that a NULL cmid->context is the cause, we can start to figure out how such a thing can happen.

            doug Doug Oucharek (Inactive) added a comment - James: we suspect that cmid->context is coming back NULL when we don't expect such a thing to happen. Can you verify the line of the crash with your binary? I'm not sure that the binary Amir used is equivalent to yours. Once we know for sure that a NULL cmid->context is the cause, we can start to figure out how such a thing can happen.

            People

              doug Doug Oucharek (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: