Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7569

IB leaf switch caused LNet routers to crash

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      During testing we lost one of the IB leaf switches which caused all of our lustre router to crash with the following error:

      2015-12-11T10:53:29.539273-05:00 c0-0c0s2n3 LNetError: 4675:0:(o2iblnd.c:399:kiblnd_find_peer_locked()) ASSERTION( peer->ibp_connecting > 0 || peer->ibp_accepting > 0 || !list_empty(&peer->ibp_conns) ) failed:
      2015-12-11T10:53:29.539305-05:00 c0-0c0s2n3 LNetError: 4675:0:(o2iblnd.c:399:kiblnd_find_peer_locked()) LBUG
      2015-12-11T10:53:29.539313-05:00 c0-0c0s2n3 Pid: 4675, comm: kgnilnd_sd_02
      2015-12-11T10:53:29.539319-05:00 c0-0c0s2n3 Call Trace:
      2015-12-11T10:53:29.539324-05:00 c0-0c0s2n3 [<ffffffff81006651>] try_stack_unwind+0x161/0x1a0
      2015-12-11T10:53:29.539332-05:00 c0-0c0s2n3 [<ffffffff81004eb9>] dump_trace+0x89/0x430
      2015-12-11T10:53:29.539339-05:00 c0-0c0s2n3 [<ffffffffa025bac0>] lbug_with_loc+0x90/0x1d0 [libcfs]
      2015-12-11T10:53:29.539348-05:00 c0-0c0s2n3 [<ffffffffa036f32b>] kiblnd_find_peer_locked+0x14b/0x150 [ko2iblnd]
      2015-12-11T10:53:29.539358-05:00 c0-0c0s2n3 [<ffffffffa036f379>] kiblnd_query+0x49/0x1c0 [ko2iblnd]
      2015-12-11T10:53:29.539364-05:00 c0-0c0s2n3 [<ffffffffa02d5aee>] lnet_post_send_locked+0x2ee/0x740 [lnet]
      2015-12-11T10:53:29.539369-05:00 c0-0c0s2n3 [<ffffffffa02d84f0>] lnet_send+0x6a0/0xcf0 [lnet]
      2015-12-11T10:53:29.539375-05:00 c0-0c0s2n3 [<ffffffffa02cbe94>] lnet_finalize+0x424/0x800 [lnet]
      2015-12-11T10:53:29.539380-05:00 c0-0c0s2n3 [<ffffffffa03d256b>] kgnilnd_recv+0x73b/0xdf0 [kgnilnd]
      2015-12-11T10:53:29.539385-05:00 c0-0c0s2n3 [<ffffffffa02d432f>] lnet_ni_recv+0xcf/0x330 [lnet]
      2015-12-11T10:53:29.539389-05:00 c0-0c0s2n3 [<ffffffffa02dac26>] lnet_parse+0x3c6/0xe40 [lnet]
      2015-12-11T10:53:29.539394-05:00 c0-0c0s2n3 [<ffffffffa03d8111>] kgnilnd_check_fma_rx+0x1af1/0x1f50 [kgnilnd]
      2015-12-11T10:53:29.539406-05:00 c0-0c0s2n3 [<ffffffffa03dbbc4>] kgnilnd_process_conns+0x554/0x15d0 [kgnilnd]
      2015-12-11T10:53:29.539411-05:00 c0-0c0s2n3 [<ffffffffa03dcf1e>] kgnilnd_scheduler+0x2de/0x5f0 [kgnilnd]
      2015-12-11T10:53:29.539416-05:00 c0-0c0s2n3 [<ffffffff81067ace>] kthread+0x9e/0xb0
      2015-12-11T10:53:29.539421-05:00 c0-0c0s2n3 [<ffffffff81490074>] kernel_thread_helper+0x4/0x10
      2015-12-11T10:53:29.539427-05:00 c0-0c0s2n3 Kernel panic - not syncing: LBUG
      2015-12-11T10:53:29.539432-05:00 c0-0c0s2n3 Pid: 4675, comm: kgnilnd_sd_02 Tainted: P 3.0.101-0.46.1_1.0502.8871-cray_gem_s #1
      2015-12-11T10:53:29.539437-05:00 c0-0c0s2n3 Call Trace:
      2015-12-11T10:53:29.539442-05:00 c0-0c0s2n3 [<ffffffff81006651>] try_stack_unwind+0x161/0x1a0
      2015-12-11T10:53:29.539447-05:00 c0-0c0s2n3 [<ffffffff81004eb9>] dump_trace+0x89/0x430
      2015-12-11T10:53:29.539452-05:00 c0-0c0s2n3 [<ffffffff810060bc>] show_trace_log_lvl+0x5c/0x80
      2015-12-11T10:53:29.539457-05:00 c0-0c0s2n3 [<ffffffff810060f5>] show_trace+0x15/0x20
      2015-12-11T10:53:29.539462-05:00 c0-0c0s2n3 [<ffffffff8148b31c>] dump_stack+0x79/0x84
      2015-12-11T10:53:29.539467-05:00 c0-0c0s2n3 [<ffffffff8148b3bb>] panic+0x94/0x1da
      2015-12-11T10:53:29.539473-05:00 c0-0c0s2n3 [<ffffffffa025bbf1>] lbug_with_loc+0x1c1/0x1d0 [libcfs]
      2015-12-11T10:53:29.539479-05:00 c0-0c0s2n3 [<ffffffffa036f32b>] kiblnd_find_peer_locked+0x14b/0x150 [ko2iblnd]
      2015-12-11T10:53:29.539484-05:00 c0-0c0s2n3 [<ffffffffa036f379>] kiblnd_query+0x49/0x1c0 [ko2iblnd]
      2015-12-11T10:53:29.539489-05:00 c0-0c0s2n3 [<ffffffffa02d5aee>] lnet_post_send_locked+0x2ee/0x740 [lnet]
      2015-12-11T10:53:29.539494-05:00 c0-0c0s2n3 [<ffffffffa02d84f0>] lnet_send+0x6a0/0xcf0 [lnet]
      2015-12-11T10:53:29.539500-05:00 c0-0c0s2n3 [<ffffffffa02cbe94>] lnet_finalize+0x424/0x800 [lnet]
      2015-12-11T10:53:29.539505-05:00 c0-0c0s2n3 [<ffffffffa03d256b>] kgnilnd_recv+0x73b/0xdf0 [kgnilnd]
      2015-12-11T10:53:29.539511-05:00 c0-0c0s2n3 [<ffffffffa02d432f>] lnet_ni_recv+0xcf/0x330 [lnet]
      2015-12-11T10:53:29.539519-05:00 c0-0c0s2n3 [<ffffffffa02dac26>] lnet_parse+0x3c6/0xe40 [lnet]
      2015-12-11T10:53:29.539526-05:00 c0-0c0s2n3 [<ffffffffa03d8111>] kgnilnd_check_fma_rx+0x1af1/0x1f50 [kgnilnd]
      2015-12-11T10:53:29.539532-05:00 c0-0c0s2n3 [<ffffffffa03dbbc4>] kgnilnd_process_conns+0x554/0x15d0 [kgnilnd]
      2015-12-11T10:53:29.539537-05:00 c0-0c0s2n3 [<ffffffffa03dcf1e>] kgnilnd_scheduler+0x2de/0x5f0 [kgnilnd]
      2015-12-11T10:53:29.539544-05:00 c0-0c0s2n3 [<ffffffff81067ace>] kthread+0x9e/0xb0
      2015-12-11T10:53:29.539550-05:00 c0-0c0s2n3 [<ffffffff81490074>] kernel_thread_helper+0x4/0x10

      Attachments

        Issue Links

          Activity

            People

              doug Doug Oucharek (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: