Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7351

LNet router crash during bring up of infiniband interface.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.8.0
    • None
    • Cray routers running Lustre 2.7.61 in an SLES11 SP3 environment.
    • 2
    • 9223372036854775807

    Description

      In our testing on our medium size Cray system we encountered the following crash while attempting to bring up LNet on the routers.

      [2015-10-26 15:51:14][c0-0c0s6n0]Lustre: kgnilnd build version: 2.7.61-DNE2-1.0502.0.2.7-jsimmons-Unknown-2015-10-21-11:16
      [2015-10-26 15:51:14][c0-0c0s6n0]LNet: Added LNI 12@gni2 [16/8192/0/0]
      [2015-10-26 15:51:15][c0-0c0s6n0]LNetError: 149:0:(o2iblnd_cb.c:2239:kiblnd_passive_connect()) Can't accept conn from 10.36.226.4@o2ib on NA (ib0:0:10.36.223.1): bad dst nid 10.36.223.1@o2ib
      [2015-10-26 15:51:15][c0-0c0s6n0]BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
      [2015-10-26 15:51:15][c0-0c0s6n0]IP: [<ffffffffa03e3e5b>] kiblnd_passive_connect+0xfb/0x16b0 [ko2iblnd]
      [2015-10-26 15:51:15][c0-0c0s6n0]PGD 3dc9ae067 PUD 3ddc0c067 PMD 0
      [2015-10-26 15:51:15][c0-0c0s6n0]Oops: 0000 1 SMP
      [2015-10-26 15:51:15][c0-0c0s6n0]CPU 5
      [2015-10-26 15:51:15][c0-0c0s6n0]Modules linked in: ko2iblnd kgnilnd lnet crc32c libcfs binfmt_misc rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core mlx4_en mlx4_ib ib_sa ib_mad ib_core mlx4_core compat nic_compat dm_mod kdreg gpcd_gem ipogif_gem kgni_gem hwerr(P) rca hss_os(P) heartbeat simplex(P) ghal_gem cgm craytrace
      [2015-10-26 15:51:15][c0-0c0s6n0]Pid: 149, comm: kworker/5:1 Tainted: P 3.0.101-0.46.1_1.0502.8871-cray_gem_s #1
      [2015-10-26 15:51:15][c0-0c0s6n0]RIP: 0010:[<ffffffffa03e3e5b>] [<ffffffffa03e3e5b>] kiblnd_passive_connect+0xfb/0x16b0 [ko2iblnd]
      [2015-10-26 15:51:15][c0-0c0s6n0]RSP: 0018:ffff8803f1b0fb10 EFLAGS: 00010246
      [2015-10-26 15:51:15][c0-0c0s6n0]RAX: 000000000000003f RBX: ffffffffa03ee513 RCX: ffffffff81368c50
      [2015-10-26 15:51:15][c0-0c0s6n0]RDX: 0000000000000100 RSI: 0000000000000000 RDI: ffff8803e8ac7680
      [2015-10-26 15:51:15][c0-0c0s6n0]RBP: ffff8803f1b0fbd0 R08: 0000000000000005 R09: 0000000000000005
      [2015-10-26 15:51:15][c0-0c0s6n0]R10: 0000000000000003 R11: 00000000ffffffff R12: 0000000000000012
      [2015-10-26 15:51:15][c0-0c0s6n0]R13: 0000000000000000 R14: 0000000000000000 R15: ffff8803c0e58620
      [2015-10-26 15:51:15][c0-0c0s6n0]FS: 00007f9e343b7700(0000) GS:ffff880407d40000(0000) knlGS:0000000000000000
      [2015-10-26 15:51:15][c0-0c0s6n0]CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [2015-10-26 15:51:15][c0-0c0s6n0]CR2: 0000000000000080 CR3: 00000003dc9bd000 CR4: 00000000000007e0
      [2015-10-26 15:51:15][c0-0c0s6n0]DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [2015-10-26 15:51:15][c0-0c0s6n0]DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [2015-10-26 15:51:15][c0-0c0s6n0]LNet: Added LNI 10.36.223.1@o2ib [63/2560/0/180]
      [2015-10-26 15:51:15][c0-0c0s6n0]Process kworker/5:1 (pid: 149, threadinfo ffff8803f1b0c000, task ffff8803f1b09040)
      [2015-10-26 15:51:15][c0-0c0s6n0]Stack:
      [2015-10-26 15:51:15][c0-0c0s6n0] ffff8803c0e58620 ffffffffa02f6080 0000000000000000 ffff8803ea8bcc00
      [2015-10-26 15:51:15][c0-0c0s6n0] ffffffffa02f6080 000500000a24e204 0000000000000001 ffff8803e4740000
      [2015-10-26 15:51:15][c0-0c0s6n0] 000300120be91b91 0000000000000000 000010000000003f ffffffffa017b636
      [2015-10-26 15:51:15][c0-0c0s6n0]Call Trace:
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa03e59bd>] kiblnd_cm_callback+0x5ad/0x2070 [ko2iblnd]
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa024145b>] cma_req_handler+0x1eb/0x550 [rdma_cm]
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa016ff57>] cm_process_work+0x27/0x130 [ib_cm]
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa0171fb0>] cm_req_handler+0x750/0xa00 [ib_cm]
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa0172385>] cm_work_handler+0x125/0xf4c [ib_cm]
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81060953>] process_one_work+0x163/0x440
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81063473>] worker_thread+0x183/0x400
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81067ace>] kthread+0x9e/0xb0
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81490074>] kernel_thread_helper+0x4/0x10
      [2015-10-26 15:51:15][c0-0c0s6n0]Code: 0f 84 da 01 00 00 66 3d 00 11 0f 84 d0 01 00 00 66 c7 45 84 12 00 45 31 f6 48 8b 05 38 03 01 00 ba 00 01 00 00 8b 00 66 89 45 90
      [2015-10-26 15:51:15][c0-0c0s6n0] 8b 86 80 00 00 00 85 c0 0f 45 d0 48 8b bd 58 ff ff ff 48 8d
      [2015-10-26 15:51:15][c0-0c0s6n0]RIP [<ffffffffa03e3e5b>] kiblnd_passive_connect+0xfb/0x16b0 [ko2iblnd]
      [2015-10-26 15:51:15][c0-0c0s6n0] RSP <ffff8803f1b0fb10>
      [2015-10-26 15:51:15][c0-0c0s6n0]CR2: 0000000000000080
      [2015-10-26 15:51:15][c0-0c0s6n0]--[ end trace 311d9fd8dd61b1cf ]--
      [2015-10-26 15:51:15][c0-0c0s6n0]Kernel panic - not syncing: Fatal exception
      [2015-10-26 15:51:15][c0-0c0s6n0]Pid: 149, comm: kworker/5:1 Tainted: P D 3.0.101-0.46.1_1.0502.8871-cray_gem_s #1
      [2015-10-26 15:51:15][c0-0c0s6n0]Call Trace:
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81006651>] try_stack_unwind+0x161/0x1a0
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81004eb9>] dump_trace+0x89/0x430
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff810060bc>] show_trace_log_lvl+0x5c/0x80
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff810060f5>] show_trace+0x15/0x20
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff8148b31c>] dump_stack+0x79/0x84
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff8148b3bb>] panic+0x94/0x1da
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81005ed8>] oops_end+0xa8/0xe0
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81027589>] no_context+0xf9/0x260
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81027855>] __bad_area_nosemaphore+0x165/0x1f0
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff810278f3>] bad_area_nosemaphore+0x13/0x20
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81027e4e>] do_page_fault+0x2fe/0x440
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff8148e8cf>] page_fault+0x1f/0x30
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa03e3e5b>] kiblnd_passive_connect+0xfb/0x16b0 [ko2iblnd]
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa03e59bd>] kiblnd_cm_callback+0x5ad/0x2070 [ko2iblnd]
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa024145b>] cma_req_handler+0x1eb/0x550 [rdma_cm]
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa016ff57>] cm_process_work+0x27/0x130 [ib_cm]
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa0171fb0>] cm_req_handler+0x750/0xa00 [ib_cm]
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffffa0172385>] cm_work_handler+0x125/0xf4c [ib_cm]
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81060953>] process_one_work+0x163/0x440
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81063473>] worker_thread+0x183/0x400
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81067ace>] kthread+0x9e/0xb0
      [2015-10-26 15:51:15][c0-0c0s6n0] [<ffffffff81490074>] kernel_thread_helper+0x4/0x10

      Attachments

        Issue Links

          Activity

            People

              doug Doug Oucharek (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: