Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5568

kernel crash when when network initialization failed

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0
    • Lustre 2.7.0
    • 3
    • 15529

    Description

      105.976884] Lustre: Lustre: Build Version: v2_6_51_0-g16c7568-CHANGED-3.10.1.el7_lustre
      [ 105.990490] LNetError: 2145:0:(socklnd.c:2660:ksocknal_enumerate_interfaces()) Can't find any usable interfaces
      [ 106.990120] LNetError: 105-4: Error -100 starting up LNI tcp
      [ 106.992703] LNetError: 2145:0:(api-ni.c:823:lnet_unprepare()) ASSERTION( list_empty(&the_lnet.ln_nis) ) failed:
      [ 106.994560] LNetError: 2145:0:(api-ni.c:823:lnet_unprepare()) LBUG
      [ 106.994561] Pid: 2145, comm: modprobe
      [ 106.994561] \x0aCall Trace:
      [ 106.994574] [<ffffffffa044f853>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
      [ 106.994578] [<ffffffffa044fdf5>] lbug_with_loc+0x45/0xc0 [libcfs]
      [ 106.994585] [<ffffffffa04f3267>] lnet_unprepare+0x297/0x340 [lnet]
      [ 106.994587] [<ffffffffa04f3b5c>] LNetNIInit+0x25c/0x3e0 [lnet]
      [ 106.994592] [<ffffffff81061bc6>] ? put_online_cpus+0x56/0x80
      [ 106.994631] [<ffffffffa0983000>] ? init_module+0x0/0x1000 [ptlrpc]
      [ 106.994658] [<ffffffffa081310c>] ptlrpc_ni_init+0x2c/0x1a0 [ptlrpc]
      [ 106.994679] [<ffffffffa0983000>] ? init_module+0x0/0x1000 [ptlrpc]
      [ 106.994703] [<ffffffffa0813291>] ptlrpc_init_portals+0x11/0xf0 [ptlrpc]
      [ 106.994722] [<ffffffffa0983000>] ? init_module+0x0/0x1000 [ptlrpc]
      [ 106.994739] [<ffffffffa09831c4>] init_module+0x1c4/0x1000 [ptlrpc]
      [ 106.994742] [<ffffffff810020e2>] do_one_initcall+0xe2/0x190
      [ 106.994744] [<ffffffff810ca7fb>] load_module+0x129b/0x1a90
      [ 106.994745] [<ffffffff812da590>] ? ddebug_dyndbg_module_param_cb+0x0/0x60
      [ 106.994747] [<ffffffff810c7133>] ? copy_module_from_fd.isra.43+0x53/0x150
      [ 106.994748] [<ffffffff810cb1a6>] SyS_finit_module+0xa6/0xd0
      [ 106.994750] [<ffffffff815f2119>] system_call_fastpath+0x16/0x1b
      [ 106.994750]
      [ 106.995032] Kernel panic - not syncing: LBUG
      [ 106.995034] CPU: 3 PID: 2145 Comm: modprobe Tainted: GF O-------------- 3.10.1.el7_lustre #1
      [ 106.995034] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
      [ 106.995036] ffffffffa0474d0d 00000000d711e588 ffff880036601bf0 ffffffff815e19ba
      [ 106.995037] ffff880036601c70 ffffffff815db549 ffffffff00000008 ffff880036601c80
      [ 106.995037] ffff880036601c20 00000000d711e588 ffffffffa051574f 0000000000000000
      [ 106.995038] Call Trace:
      [ 106.995052] [<ffffffff815e19ba>] dump_stack+0x19/0x1b
      [ 106.995055] [<ffffffff815db549>] panic+0xd8/0x1e7
      [ 106.995062] [<ffffffffa044fe5b>] lbug_with_loc+0xab/0xc0 [libcfs]
      [ 106.995067] [<ffffffffa04f3267>] lnet_unprepare+0x297/0x340 [lnet]
      [ 106.995070] [<ffffffffa04f3b5c>] LNetNIInit+0x25c/0x3e0 [lnet]
      [ 106.995072] [<ffffffff81061bc6>] ? put_online_cpus+0x56/0x80
      [ 106.995088] [<ffffffffa0983000>] ? 0xffffffffa0982fff
      [ 106.995113] [<ffffffffa081310c>] ptlrpc_ni_init+0x2c/0x1a0 [ptlrpc]
      [ 106.995117] [<ffffffffa0983000>] ? 0xffffffffa0982fff
      [ 106.995138] [<ffffffffa0813291>] ptlrpc_init_portals+0x11/0xf0 [ptlrpc]
      [ 106.995141] [<ffffffffa0983000>] ? 0xffffffffa0982fff
      [ 106.995179] [<ffffffffa09831c4>] init_module+0x1c4/0x1000 [ptlrpc]
      [ 106.995181] [<ffffffff810020e2>] do_one_initcall+0xe2/0x190
      [ 106.995182] [<ffffffff810ca7fb>] load_module+0x129b/0x1a90
      [ 106.995185] [<ffffffff812da590>] ? ddebug_proc_write+0xf0/0xf0
      [ 106.995186] [<ffffffff810c7133>] ? copy_module_from_fd.isra.43+0x53/0x150
      [ 106.995187] [<ffffffff810cb1a6>] SyS_finit_module+0xa6/0xd0
      [ 106.995189] [<ffffffff815f2119>] system_call_fastpath+0x16/0x1b

      Attachments

        Issue Links

          Activity

            [LU-5568] kernel crash when when network initialization failed

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12512/
            Subject: LU-5568 lnet: fix kernel crash when network failed to start
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 66e9055b23433bd0aa8da5e49f3b665fb1b95532

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12512/ Subject: LU-5568 lnet: fix kernel crash when network failed to start Project: fs/lustre-release Branch: master Current Patch Set: Commit: 66e9055b23433bd0aa8da5e49f3b665fb1b95532
            pjones Peter Jones added a comment -

            Patch reverted. Amir could you please look into this issue? Thanks

            pjones Peter Jones added a comment - Patch reverted. Amir could you please look into this issue? Thanks
            pjones Peter Jones added a comment -

            Landed for 2.7

            pjones Peter Jones added a comment - Landed for 2.7
            wangshilong Wang Shilong (Inactive) added a comment - - edited

            Hi lsaac Huang,

            I am sorry for bothering you, could you please help review new version
            and give me some feedbacks, thank you very much!

            Best regards,
            Wang Shilong

            wangshilong Wang Shilong (Inactive) added a comment - - edited Hi lsaac Huang, I am sorry for bothering you, could you please help review new version and give me some feedbacks, thank you very much! Best regards, Wang Shilong

            I think the fix should be considered together with LU-5734 - the two are closely related.

            isaac Isaac Huang (Inactive) added a comment - I think the fix should be considered together with LU-5734 - the two are closely related.

            lnet_shutdown_lndnis() will not free the ni if the failure happened early, i.e. the ni hasn't been added to any global lists. For example, if it failed in the first few conditional checks in lnet_startup_lndni(), lnet_shutdown_lndnis() will not be called at all.

            isaac Isaac Huang (Inactive) added a comment - lnet_shutdown_lndnis() will not free the ni if the failure happened early, i.e. the ni hasn't been added to any global lists. For example, if it failed in the first few conditional checks in lnet_startup_lndni(), lnet_shutdown_lndnis() will not be called at all.

            Hi Isaac Huang,

            Thanks very much for your comments, could you take a look and give me some response
            about last your comments:

            "When we goto failed here, the ni is no longer on the nilist, then where is the ni going to be freed?"

            My reply: lnet_shutdown_lndnis() could do that?

            btw, i am not sure why i reply comments at redmine, it did not give email or something...

            wangshilong Wang Shilong (Inactive) added a comment - Hi Isaac Huang, Thanks very much for your comments, could you take a look and give me some response about last your comments: "When we goto failed here, the ni is no longer on the nilist, then where is the ni going to be freed?" My reply: lnet_shutdown_lndnis() could do that? btw, i am not sure why i reply comments at redmine, it did not give email or something...

            Hi Oleg Drokin,

            Yeah, i did address lsaac's comment!

            wangshilong Wang Shilong (Inactive) added a comment - Hi Oleg Drokin, Yeah, i did address lsaac's comment!
            green Oleg Drokin added a comment -

            Wang, do you have plans of addressing Isaac's points in your patch?

            green Oleg Drokin added a comment - Wang, do you have plans of addressing Isaac's points in your patch?

            The bug seemed to be introduced by commit 92c51841c50cc4061c20b277d3f7c4468f2a80cc. While the proposed patch fixed the symptom, it left the underlying API inconsistency unfixed: the lnet internal initialization APIs are somewhat transaction-like in that if a init function fails it cleanups itself so the caller don't have to call the corresponding fini function, e.g. if lnet_prepare() fails there's no need to call lnet_unprepare().

            But with 92c51841c50cc4061c20b277d3f7c4468f2a80cc, lnet_shutdown_lndnis() was removed from lnet_startup_lndnis(), so now the callers of lnet_startup_lndnis() will be responsible to clean up if lnet_startup_lndnis() has failed, which breaks the convention that init() functions clean up by themselves on failures. This kind of inconsistency will cause us troubles in the future. I'd suggest to:
            1. Move lnet_shutdown_lndnis() back to lnet_startup_lndnis() so that lnet_startup_lndnis() will cleanup itself.
            2. Move the code in lnet_startup_lndnis() that starts a single NI into a new function e.g. startup_a_single_ni().
            3. Make lnet_dyn_add_ni() call startup_a_single_ni() instead of lnet_startup_lndnis().

            isaac Isaac Huang (Inactive) added a comment - The bug seemed to be introduced by commit 92c51841c50cc4061c20b277d3f7c4468f2a80cc. While the proposed patch fixed the symptom, it left the underlying API inconsistency unfixed: the lnet internal initialization APIs are somewhat transaction-like in that if a init function fails it cleanups itself so the caller don't have to call the corresponding fini function, e.g. if lnet_prepare() fails there's no need to call lnet_unprepare(). But with 92c51841c50cc4061c20b277d3f7c4468f2a80cc, lnet_shutdown_lndnis() was removed from lnet_startup_lndnis(), so now the callers of lnet_startup_lndnis() will be responsible to clean up if lnet_startup_lndnis() has failed, which breaks the convention that init() functions clean up by themselves on failures. This kind of inconsistency will cause us troubles in the future. I'd suggest to: 1. Move lnet_shutdown_lndnis() back to lnet_startup_lndnis() so that lnet_startup_lndnis() will cleanup itself. 2. Move the code in lnet_startup_lndnis() that starts a single NI into a new function e.g. startup_a_single_ni(). 3. Make lnet_dyn_add_ni() call startup_a_single_ni() instead of lnet_startup_lndnis().

            People

              ashehata Amir Shehata (Inactive)
              wangshilong Wang Shilong (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: