Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.7.0, Lustre 2.4.3, Lustre 2.5.3
    • None
    • Git repo can be found at https://github.com/jlan/lustre-nas
      Server: centos 6.4 2.6.32-358.23.2.el6, lustre 2.4.3-12nasS
      Client: sles11sp3 3.0.101-0.31.1, lustre 2.4.3-11nasC
    • 3
    • 17274

    Description

      Yesterday experienced a network problem. Consequently, we had a number of clients stalled. At least four were hanged in this situation. We captured a vmcore on one of the systems.

      Console logs showed one of the CPUs was detected to stall:
      "INFO: rcu_sched_state detected stall on CPU 9."

      All CPU's at r305i7n2 except CPU 9 were running migration process and
      the rcu_sched_state detected CPU was running obd_zombid.
      The console logs of other three systems confirmed the stalled CPU were
      running obd_zombid also, but without vmcore I can not say for sure that
      other CPU's were running 'migration' as r305i7n2 did.

      The stack trace is:

      PID: 5070 TASK: ffff88046f086300 CPU: 9 COMMAND: "obd_zombid"
      #0 [ffff88087fc27e40] crash_nmi_callback at ffffffff810245af
      #1 [ffff88087fc27e50] notifier_call_chain at ffffffff81475847
      #2 [ffff88087fc27e80] __atomic_notifier_call_chain at ffffffff8147588d
      #3 [ffff88087fc27e90] notify_die at ffffffff814758dd
      #4 [ffff88087fc27ec0] default_do_nmi at ffffffff81472d37
      #5 [ffff88087fc27ee0] do_nmi at ffffffff81472f68
      #6 [ffff88087fc27ef0] restart_nmi at ffffffff814724b1
      [exception RIP: native_halt+1]
      RIP: ffffffff810300b1 RSP: ffff88087fc23de0 RFLAGS: 00000082
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000080f
      RDX: 0000000000000000 RSI: 00000000000000ff RDI: 000000000000080f
      RBP: ffff88046d96fd78 R8: 0000000000000150 R9: ffffe8ffffc20738
      R10: 0000000000000006 R11: ffffffff8102b430 R12: 0000000000000000
      R13: 0000000000000006 R14: 0000000000000006 R15: 00000000fffffffb
      ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
      — <NMI exception stack> —
      #7 [ffff88087fc23de0] native_halt at ffffffff810300b1
      #8 [ffff88087fc23de0] halt_current_cpu at ffffffff81024959
      #9 [ffff88087fc23df0] lkdb_main_loop at ffffffff812548ec
      #10 [ffff88087fc23ef0] kdba_main_loop at ffffffff8139bef2
      #11 [ffff88087fc23f20] kdb at ffffffff8125199f
      #12 [ffff88087fc23f80] kdb_ipi at ffffffff8124ea07
      #13 [ffff88087fc23f90] smp_kdb_interrupt at ffffffff8139b656
      #14 [ffff88087fc23fb0] kdb_interrupt at ffffffff8147aca3
      — <IRQ stack> —
      #15 [ffff88046d96fd78] kdb_interrupt at ffffffff8147aca3
      [exception RIP: _raw_spin_lock+24]
      RIP: ffffffff81471a88 RSP: ffff88046d96fe28 RFLAGS: 00000206
      RAX: 0000000000001700 RBX: ffff880867d28810 RCX: ffff880856c3be00
      RDX: 0000000000008000 RSI: ffff880856c3be00 RDI: ffff880430b100f8
      RBP: ffff880864634078 R8: 0000000000000002 R9: 0000000000000000
      R10: 0000000010000008 R11: 0000000000000000 R12: ffffffff8147ac9e
      R13: ffffffff811458be R14: ffff880867d28810 R15: 0000000000000206
      ORIG_RAX: ffffffffffffff01 CS: 0010 SS: 0018
      #16 [ffff88046d96fe28] osc_cleanup at ffffffffa0a48829 [osc]
      #17 [ffff88046d96fe38] class_decref at ffffffffa076eed4 [obdclass]
      #18 [ffff88046d96fea8] class_export_destroy at ffffffffa074c1de [obdclass]
      #19 [ffff88046d96fec8] obd_zombie_impexp_cull at ffffffffa074c61d [obdclass]
      #20 [ffff88046d96fee8] obd_zombie_impexp_thread at ffffffffa074c7bd [obdclass]
      #21 [ffff88046d96ff48] kernel_thread_helper at ffffffff8147aae4

      Attachments

        Issue Links

          Activity

            [LU-6173] CPU stalled with obd_zombid running
            pjones Peter Jones added a comment -

            Yes this is being worked on

            pjones Peter Jones added a comment - Yes this is being worked on

            Could you provide a 2.5 back port? Thanks!

            jaylan Jay Lan (Inactive) added a comment - Could you provide a 2.5 back port? Thanks!
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13746/
            Subject: LU-6173 llite: allocate and free client cache asynchronously
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 302c5bfebe61e988dbd27063becc4ef90befc6df

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13746/ Subject: LU-6173 llite: allocate and free client cache asynchronously Project: fs/lustre-release Branch: master Current Patch Set: Commit: 302c5bfebe61e988dbd27063becc4ef90befc6df

            Emoly Liu (emoly.liu@intel.com) uploaded a new patch: http://review.whamcloud.com/13746
            Subject: LU-6173 llite: allocate and free client cache asynchronously
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 570a48915a6935b8d180dafded4befaa2447b585

            gerrit Gerrit Updater added a comment - Emoly Liu (emoly.liu@intel.com) uploaded a new patch: http://review.whamcloud.com/13746 Subject: LU-6173 llite: allocate and free client cache asynchronously Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 570a48915a6935b8d180dafded4befaa2447b585
            emoly.liu Emoly Liu added a comment -

            Peter, yes both master and b2_5 need the patch. I will create one for master later.

            emoly.liu Emoly Liu added a comment - Peter, yes both master and b2_5 need the patch. I will create one for master later.
            pjones Peter Jones added a comment -

            Emoly

            Is this patch also required for master/b2_5?

            Peter

            pjones Peter Jones added a comment - Emoly Is this patch also required for master/b2_5? Peter
            emoly.liu Emoly Liu added a comment -

            Thanks for Niu&Oleg's help! I pushed a patch for b2_4 for review.

            emoly.liu Emoly Liu added a comment - Thanks for Niu&Oleg's help! I pushed a patch for b2_4 for review.

            Emoly Liu (emoly.liu@intel.com) uploaded a new patch: http://review.whamcloud.com/13727
            Subject: LU-6173 llite: allocate and free client cache asynchronously
            Project: fs/lustre-release
            Branch: b2_4
            Current Patch Set: 1
            Commit: ae23e1e99d072c3865ca2da538705eb61fc6c7c2

            gerrit Gerrit Updater added a comment - Emoly Liu (emoly.liu@intel.com) uploaded a new patch: http://review.whamcloud.com/13727 Subject: LU-6173 llite: allocate and free client cache asynchronously Project: fs/lustre-release Branch: b2_4 Current Patch Set: 1 Commit: ae23e1e99d072c3865ca2da538705eb61fc6c7c2
            green Oleg Drokin added a comment -

            Niu: It's right in the __ptlrpc_request_alloc():

                            request->rq_import = class_import_get(imp);
            

            and the import stays put until all requests are drained, which might take awhile if the requests are stuck on the network.

            green Oleg Drokin added a comment - Niu: It's right in the __ptlrpc_request_alloc(): request->rq_import = class_import_get(imp); and the import stays put until all requests are drained, which might take awhile if the requests are stuck on the network.

            People

              emoly.liu Emoly Liu
              jaylan Jay Lan (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: