Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7569

IB leaf switch caused LNet routers to crash

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      During testing we lost one of the IB leaf switches which caused all of our lustre router to crash with the following error:

      2015-12-11T10:53:29.539273-05:00 c0-0c0s2n3 LNetError: 4675:0:(o2iblnd.c:399:kiblnd_find_peer_locked()) ASSERTION( peer->ibp_connecting > 0 || peer->ibp_accepting > 0 || !list_empty(&peer->ibp_conns) ) failed:
      2015-12-11T10:53:29.539305-05:00 c0-0c0s2n3 LNetError: 4675:0:(o2iblnd.c:399:kiblnd_find_peer_locked()) LBUG
      2015-12-11T10:53:29.539313-05:00 c0-0c0s2n3 Pid: 4675, comm: kgnilnd_sd_02
      2015-12-11T10:53:29.539319-05:00 c0-0c0s2n3 Call Trace:
      2015-12-11T10:53:29.539324-05:00 c0-0c0s2n3 [<ffffffff81006651>] try_stack_unwind+0x161/0x1a0
      2015-12-11T10:53:29.539332-05:00 c0-0c0s2n3 [<ffffffff81004eb9>] dump_trace+0x89/0x430
      2015-12-11T10:53:29.539339-05:00 c0-0c0s2n3 [<ffffffffa025bac0>] lbug_with_loc+0x90/0x1d0 [libcfs]
      2015-12-11T10:53:29.539348-05:00 c0-0c0s2n3 [<ffffffffa036f32b>] kiblnd_find_peer_locked+0x14b/0x150 [ko2iblnd]
      2015-12-11T10:53:29.539358-05:00 c0-0c0s2n3 [<ffffffffa036f379>] kiblnd_query+0x49/0x1c0 [ko2iblnd]
      2015-12-11T10:53:29.539364-05:00 c0-0c0s2n3 [<ffffffffa02d5aee>] lnet_post_send_locked+0x2ee/0x740 [lnet]
      2015-12-11T10:53:29.539369-05:00 c0-0c0s2n3 [<ffffffffa02d84f0>] lnet_send+0x6a0/0xcf0 [lnet]
      2015-12-11T10:53:29.539375-05:00 c0-0c0s2n3 [<ffffffffa02cbe94>] lnet_finalize+0x424/0x800 [lnet]
      2015-12-11T10:53:29.539380-05:00 c0-0c0s2n3 [<ffffffffa03d256b>] kgnilnd_recv+0x73b/0xdf0 [kgnilnd]
      2015-12-11T10:53:29.539385-05:00 c0-0c0s2n3 [<ffffffffa02d432f>] lnet_ni_recv+0xcf/0x330 [lnet]
      2015-12-11T10:53:29.539389-05:00 c0-0c0s2n3 [<ffffffffa02dac26>] lnet_parse+0x3c6/0xe40 [lnet]
      2015-12-11T10:53:29.539394-05:00 c0-0c0s2n3 [<ffffffffa03d8111>] kgnilnd_check_fma_rx+0x1af1/0x1f50 [kgnilnd]
      2015-12-11T10:53:29.539406-05:00 c0-0c0s2n3 [<ffffffffa03dbbc4>] kgnilnd_process_conns+0x554/0x15d0 [kgnilnd]
      2015-12-11T10:53:29.539411-05:00 c0-0c0s2n3 [<ffffffffa03dcf1e>] kgnilnd_scheduler+0x2de/0x5f0 [kgnilnd]
      2015-12-11T10:53:29.539416-05:00 c0-0c0s2n3 [<ffffffff81067ace>] kthread+0x9e/0xb0
      2015-12-11T10:53:29.539421-05:00 c0-0c0s2n3 [<ffffffff81490074>] kernel_thread_helper+0x4/0x10
      2015-12-11T10:53:29.539427-05:00 c0-0c0s2n3 Kernel panic - not syncing: LBUG
      2015-12-11T10:53:29.539432-05:00 c0-0c0s2n3 Pid: 4675, comm: kgnilnd_sd_02 Tainted: P 3.0.101-0.46.1_1.0502.8871-cray_gem_s #1
      2015-12-11T10:53:29.539437-05:00 c0-0c0s2n3 Call Trace:
      2015-12-11T10:53:29.539442-05:00 c0-0c0s2n3 [<ffffffff81006651>] try_stack_unwind+0x161/0x1a0
      2015-12-11T10:53:29.539447-05:00 c0-0c0s2n3 [<ffffffff81004eb9>] dump_trace+0x89/0x430
      2015-12-11T10:53:29.539452-05:00 c0-0c0s2n3 [<ffffffff810060bc>] show_trace_log_lvl+0x5c/0x80
      2015-12-11T10:53:29.539457-05:00 c0-0c0s2n3 [<ffffffff810060f5>] show_trace+0x15/0x20
      2015-12-11T10:53:29.539462-05:00 c0-0c0s2n3 [<ffffffff8148b31c>] dump_stack+0x79/0x84
      2015-12-11T10:53:29.539467-05:00 c0-0c0s2n3 [<ffffffff8148b3bb>] panic+0x94/0x1da
      2015-12-11T10:53:29.539473-05:00 c0-0c0s2n3 [<ffffffffa025bbf1>] lbug_with_loc+0x1c1/0x1d0 [libcfs]
      2015-12-11T10:53:29.539479-05:00 c0-0c0s2n3 [<ffffffffa036f32b>] kiblnd_find_peer_locked+0x14b/0x150 [ko2iblnd]
      2015-12-11T10:53:29.539484-05:00 c0-0c0s2n3 [<ffffffffa036f379>] kiblnd_query+0x49/0x1c0 [ko2iblnd]
      2015-12-11T10:53:29.539489-05:00 c0-0c0s2n3 [<ffffffffa02d5aee>] lnet_post_send_locked+0x2ee/0x740 [lnet]
      2015-12-11T10:53:29.539494-05:00 c0-0c0s2n3 [<ffffffffa02d84f0>] lnet_send+0x6a0/0xcf0 [lnet]
      2015-12-11T10:53:29.539500-05:00 c0-0c0s2n3 [<ffffffffa02cbe94>] lnet_finalize+0x424/0x800 [lnet]
      2015-12-11T10:53:29.539505-05:00 c0-0c0s2n3 [<ffffffffa03d256b>] kgnilnd_recv+0x73b/0xdf0 [kgnilnd]
      2015-12-11T10:53:29.539511-05:00 c0-0c0s2n3 [<ffffffffa02d432f>] lnet_ni_recv+0xcf/0x330 [lnet]
      2015-12-11T10:53:29.539519-05:00 c0-0c0s2n3 [<ffffffffa02dac26>] lnet_parse+0x3c6/0xe40 [lnet]
      2015-12-11T10:53:29.539526-05:00 c0-0c0s2n3 [<ffffffffa03d8111>] kgnilnd_check_fma_rx+0x1af1/0x1f50 [kgnilnd]
      2015-12-11T10:53:29.539532-05:00 c0-0c0s2n3 [<ffffffffa03dbbc4>] kgnilnd_process_conns+0x554/0x15d0 [kgnilnd]
      2015-12-11T10:53:29.539537-05:00 c0-0c0s2n3 [<ffffffffa03dcf1e>] kgnilnd_scheduler+0x2de/0x5f0 [kgnilnd]
      2015-12-11T10:53:29.539544-05:00 c0-0c0s2n3 [<ffffffff81067ace>] kthread+0x9e/0xb0
      2015-12-11T10:53:29.539550-05:00 c0-0c0s2n3 [<ffffffff81490074>] kernel_thread_helper+0x4/0x10

      Attachments

        Issue Links

          Activity

            [LU-7569] IB leaf switch caused LNet routers to crash

            Doug, although the patch at http://review.whamcloud.com/#/c/18051/.
            says "b2_7_fe" under "branch", there is a red remark saying "Cannot Merge" under "Strategy."

            Probably the patch was not generated against b2_7_fe? I encountered non-trivia conflicts at
            both modified: lnet/klnds/o2iblnd/o2iblnd.h
            both modified: lnet/klnds/o2iblnd/o2iblnd_cb.c

            jaylan Jay Lan (Inactive) added a comment - Doug, although the patch at http://review.whamcloud.com/#/c/18051/ . says "b2_7_fe" under "branch", there is a red remark saying "Cannot Merge" under "Strategy." Probably the patch was not generated against b2_7_fe? I encountered non-trivia conflicts at both modified: lnet/klnds/o2iblnd/o2iblnd.h both modified: lnet/klnds/o2iblnd/o2iblnd_cb.c

            Jay, it has been ported to 2.7 FE as: http://review.whamcloud.com/#/c/18051/.

            doug Doug Oucharek (Inactive) added a comment - Jay, it has been ported to 2.7 FE as: http://review.whamcloud.com/#/c/18051/ .

            Jay, This patch is under review now. So, soon it will be landed to b2_7_fe also.

            dmiter Dmitry Eremin (Inactive) added a comment - Jay, This patch is under review now. So, soon it will be landed to b2_7_fe also.
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17892/
            Subject: LU-7569 o2iblnd: avoid intensive reconnecting
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 9ab698e4d99103b2fecf19b0fd3f90d28723e9d1

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17892/ Subject: LU-7569 o2iblnd: avoid intensive reconnecting Project: fs/lustre-release Branch: master Current Patch Set: Commit: 9ab698e4d99103b2fecf19b0fd3f90d28723e9d1

            Hmmm. Doesn't appear to LNet related.

            2016-01-13T15:56:22.464787-05:00 c0-0c0s7n1 LustreError: 167-0: sultan-OST000e-osc-ffff880405221400: This client was evicted by sultan-OST000e; in progress operations using this service will fail.
            2016-01-13T15:56:22.464804-05:00 c0-0c0s7n1 LustreError: Skipped 11 previous similar messages
            2016-01-13T15:56:22.514072-05:00 c0-0c0s7n1 Lustre: 2541:0:(llite_lib.c:2628:ll_dirty_page_discard_warn()) sultan: dirty page discard: 10.37.248.67@o2ib1:/sultan/fid: [0x20000a898:0x21:0x0]//stf008/scratch/jsimmons/test_ior/testfile.out may get corrupted (rc -108)
            2016-01-13T15:56:22.756634-05:00 c0-0c0s7n1 Lustre: sultan-OST0025-osc-ffff880405221400: Connection restored to 10.37.248.70@o2ib1 (at 10.37.248.70@o2ib1)

            simmonsja James A Simmons added a comment - Hmmm. Doesn't appear to LNet related. 2016-01-13T15:56:22.464787-05:00 c0-0c0s7n1 LustreError: 167-0: sultan-OST000e-osc-ffff880405221400: This client was evicted by sultan-OST000e; in progress operations using this service will fail. 2016-01-13T15:56:22.464804-05:00 c0-0c0s7n1 LustreError: Skipped 11 previous similar messages 2016-01-13T15:56:22.514072-05:00 c0-0c0s7n1 Lustre: 2541:0:(llite_lib.c:2628:ll_dirty_page_discard_warn()) sultan: dirty page discard: 10.37.248.67@o2ib1:/sultan/fid: [0x20000a898:0x21:0x0] //stf008/scratch/jsimmons/test_ior/testfile.out may get corrupted (rc -108) 2016-01-13T15:56:22.756634-05:00 c0-0c0s7n1 Lustre: sultan-OST0025-osc-ffff880405221400: Connection restored to 10.37.248.70@o2ib1 (at 10.37.248.70@o2ib1)

            what's the crash looks like after reverting of 14600, is it OOM or another assertion?

            liang Liang Zhen (Inactive) added a comment - what's the crash looks like after reverting of 14600, is it OOM or another assertion?
            simmonsja James A Simmons added a comment - - edited

            I just tested this patch with 14600 reverted on our Cray system and the routers crashed again. So this is still a serious problem. If we lose a IB leaf switch we lose the entire file system. IMHO this should be a blocker.

            simmonsja James A Simmons added a comment - - edited I just tested this patch with 14600 reverted on our Cray system and the routers crashed again. So this is still a serious problem. If we lose a IB leaf switch we lose the entire file system. IMHO this should be a blocker.

            Is this still true with 14600 reverted?

            doug Doug Oucharek (Inactive) added a comment - Is this still true with 14600 reverted?

            I'm of the opinion that this should be a blocker. Currently without this work it is possible if a IB leaf switch reboots to take down all the LNet routers. Because of this it needs to be slated for 2.8 landing.

            simmonsja James A Simmons added a comment - I'm of the opinion that this should be a blocker. Currently without this work it is possible if a IB leaf switch reboots to take down all the LNet routers. Because of this it needs to be slated for 2.8 landing.

            People

              doug Doug Oucharek (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: