Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5568

kernel crash when when network initialization failed

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0
    • Lustre 2.7.0
    • 3
    • 15529

    Description

      105.976884] Lustre: Lustre: Build Version: v2_6_51_0-g16c7568-CHANGED-3.10.1.el7_lustre
      [ 105.990490] LNetError: 2145:0:(socklnd.c:2660:ksocknal_enumerate_interfaces()) Can't find any usable interfaces
      [ 106.990120] LNetError: 105-4: Error -100 starting up LNI tcp
      [ 106.992703] LNetError: 2145:0:(api-ni.c:823:lnet_unprepare()) ASSERTION( list_empty(&the_lnet.ln_nis) ) failed:
      [ 106.994560] LNetError: 2145:0:(api-ni.c:823:lnet_unprepare()) LBUG
      [ 106.994561] Pid: 2145, comm: modprobe
      [ 106.994561] \x0aCall Trace:
      [ 106.994574] [<ffffffffa044f853>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
      [ 106.994578] [<ffffffffa044fdf5>] lbug_with_loc+0x45/0xc0 [libcfs]
      [ 106.994585] [<ffffffffa04f3267>] lnet_unprepare+0x297/0x340 [lnet]
      [ 106.994587] [<ffffffffa04f3b5c>] LNetNIInit+0x25c/0x3e0 [lnet]
      [ 106.994592] [<ffffffff81061bc6>] ? put_online_cpus+0x56/0x80
      [ 106.994631] [<ffffffffa0983000>] ? init_module+0x0/0x1000 [ptlrpc]
      [ 106.994658] [<ffffffffa081310c>] ptlrpc_ni_init+0x2c/0x1a0 [ptlrpc]
      [ 106.994679] [<ffffffffa0983000>] ? init_module+0x0/0x1000 [ptlrpc]
      [ 106.994703] [<ffffffffa0813291>] ptlrpc_init_portals+0x11/0xf0 [ptlrpc]
      [ 106.994722] [<ffffffffa0983000>] ? init_module+0x0/0x1000 [ptlrpc]
      [ 106.994739] [<ffffffffa09831c4>] init_module+0x1c4/0x1000 [ptlrpc]
      [ 106.994742] [<ffffffff810020e2>] do_one_initcall+0xe2/0x190
      [ 106.994744] [<ffffffff810ca7fb>] load_module+0x129b/0x1a90
      [ 106.994745] [<ffffffff812da590>] ? ddebug_dyndbg_module_param_cb+0x0/0x60
      [ 106.994747] [<ffffffff810c7133>] ? copy_module_from_fd.isra.43+0x53/0x150
      [ 106.994748] [<ffffffff810cb1a6>] SyS_finit_module+0xa6/0xd0
      [ 106.994750] [<ffffffff815f2119>] system_call_fastpath+0x16/0x1b
      [ 106.994750]
      [ 106.995032] Kernel panic - not syncing: LBUG
      [ 106.995034] CPU: 3 PID: 2145 Comm: modprobe Tainted: GF O-------------- 3.10.1.el7_lustre #1
      [ 106.995034] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
      [ 106.995036] ffffffffa0474d0d 00000000d711e588 ffff880036601bf0 ffffffff815e19ba
      [ 106.995037] ffff880036601c70 ffffffff815db549 ffffffff00000008 ffff880036601c80
      [ 106.995037] ffff880036601c20 00000000d711e588 ffffffffa051574f 0000000000000000
      [ 106.995038] Call Trace:
      [ 106.995052] [<ffffffff815e19ba>] dump_stack+0x19/0x1b
      [ 106.995055] [<ffffffff815db549>] panic+0xd8/0x1e7
      [ 106.995062] [<ffffffffa044fe5b>] lbug_with_loc+0xab/0xc0 [libcfs]
      [ 106.995067] [<ffffffffa04f3267>] lnet_unprepare+0x297/0x340 [lnet]
      [ 106.995070] [<ffffffffa04f3b5c>] LNetNIInit+0x25c/0x3e0 [lnet]
      [ 106.995072] [<ffffffff81061bc6>] ? put_online_cpus+0x56/0x80
      [ 106.995088] [<ffffffffa0983000>] ? 0xffffffffa0982fff
      [ 106.995113] [<ffffffffa081310c>] ptlrpc_ni_init+0x2c/0x1a0 [ptlrpc]
      [ 106.995117] [<ffffffffa0983000>] ? 0xffffffffa0982fff
      [ 106.995138] [<ffffffffa0813291>] ptlrpc_init_portals+0x11/0xf0 [ptlrpc]
      [ 106.995141] [<ffffffffa0983000>] ? 0xffffffffa0982fff
      [ 106.995179] [<ffffffffa09831c4>] init_module+0x1c4/0x1000 [ptlrpc]
      [ 106.995181] [<ffffffff810020e2>] do_one_initcall+0xe2/0x190
      [ 106.995182] [<ffffffff810ca7fb>] load_module+0x129b/0x1a90
      [ 106.995185] [<ffffffff812da590>] ? ddebug_proc_write+0xf0/0xf0
      [ 106.995186] [<ffffffff810c7133>] ? copy_module_from_fd.isra.43+0x53/0x150
      [ 106.995187] [<ffffffff810cb1a6>] SyS_finit_module+0xa6/0xd0
      [ 106.995189] [<ffffffff815f2119>] system_call_fastpath+0x16/0x1b

      Attachments

        Issue Links

          Activity

            [LU-5568] kernel crash when when network initialization failed

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12512/
            Subject: LU-5568 lnet: fix kernel crash when network failed to start
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 66e9055b23433bd0aa8da5e49f3b665fb1b95532

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12512/ Subject: LU-5568 lnet: fix kernel crash when network failed to start Project: fs/lustre-release Branch: master Current Patch Set: Commit: 66e9055b23433bd0aa8da5e49f3b665fb1b95532
            pjones Peter Jones added a comment -

            Patch reverted. Amir could you please look into this issue? Thanks

            pjones Peter Jones added a comment - Patch reverted. Amir could you please look into this issue? Thanks
            pjones Peter Jones added a comment -

            Landed for 2.7

            pjones Peter Jones added a comment - Landed for 2.7
            wangshilong Wang Shilong (Inactive) added a comment - - edited

            Hi lsaac Huang,

            I am sorry for bothering you, could you please help review new version
            and give me some feedbacks, thank you very much!

            Best regards,
            Wang Shilong

            wangshilong Wang Shilong (Inactive) added a comment - - edited Hi lsaac Huang, I am sorry for bothering you, could you please help review new version and give me some feedbacks, thank you very much! Best regards, Wang Shilong

            I think the fix should be considered together with LU-5734 - the two are closely related.

            isaac Isaac Huang (Inactive) added a comment - I think the fix should be considered together with LU-5734 - the two are closely related.

            lnet_shutdown_lndnis() will not free the ni if the failure happened early, i.e. the ni hasn't been added to any global lists. For example, if it failed in the first few conditional checks in lnet_startup_lndni(), lnet_shutdown_lndnis() will not be called at all.

            isaac Isaac Huang (Inactive) added a comment - lnet_shutdown_lndnis() will not free the ni if the failure happened early, i.e. the ni hasn't been added to any global lists. For example, if it failed in the first few conditional checks in lnet_startup_lndni(), lnet_shutdown_lndnis() will not be called at all.

            Hi Isaac Huang,

            Thanks very much for your comments, could you take a look and give me some response
            about last your comments:

            "When we goto failed here, the ni is no longer on the nilist, then where is the ni going to be freed?"

            My reply: lnet_shutdown_lndnis() could do that?

            btw, i am not sure why i reply comments at redmine, it did not give email or something...

            wangshilong Wang Shilong (Inactive) added a comment - Hi Isaac Huang, Thanks very much for your comments, could you take a look and give me some response about last your comments: "When we goto failed here, the ni is no longer on the nilist, then where is the ni going to be freed?" My reply: lnet_shutdown_lndnis() could do that? btw, i am not sure why i reply comments at redmine, it did not give email or something...

            Hi Oleg Drokin,

            Yeah, i did address lsaac's comment!

            wangshilong Wang Shilong (Inactive) added a comment - Hi Oleg Drokin, Yeah, i did address lsaac's comment!

            People

              ashehata Amir Shehata (Inactive)
              wangshilong Wang Shilong (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: