Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.7.0, Lustre 2.4.3, Lustre 2.5.3
    • None
    • Git repo can be found at https://github.com/jlan/lustre-nas
      Server: centos 6.4 2.6.32-358.23.2.el6, lustre 2.4.3-12nasS
      Client: sles11sp3 3.0.101-0.31.1, lustre 2.4.3-11nasC
    • 3
    • 17274

    Description

      Yesterday experienced a network problem. Consequently, we had a number of clients stalled. At least four were hanged in this situation. We captured a vmcore on one of the systems.

      Console logs showed one of the CPUs was detected to stall:
      "INFO: rcu_sched_state detected stall on CPU 9."

      All CPU's at r305i7n2 except CPU 9 were running migration process and
      the rcu_sched_state detected CPU was running obd_zombid.
      The console logs of other three systems confirmed the stalled CPU were
      running obd_zombid also, but without vmcore I can not say for sure that
      other CPU's were running 'migration' as r305i7n2 did.

      The stack trace is:

      PID: 5070 TASK: ffff88046f086300 CPU: 9 COMMAND: "obd_zombid"
      #0 [ffff88087fc27e40] crash_nmi_callback at ffffffff810245af
      #1 [ffff88087fc27e50] notifier_call_chain at ffffffff81475847
      #2 [ffff88087fc27e80] __atomic_notifier_call_chain at ffffffff8147588d
      #3 [ffff88087fc27e90] notify_die at ffffffff814758dd
      #4 [ffff88087fc27ec0] default_do_nmi at ffffffff81472d37
      #5 [ffff88087fc27ee0] do_nmi at ffffffff81472f68
      #6 [ffff88087fc27ef0] restart_nmi at ffffffff814724b1
      [exception RIP: native_halt+1]
      RIP: ffffffff810300b1 RSP: ffff88087fc23de0 RFLAGS: 00000082
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000080f
      RDX: 0000000000000000 RSI: 00000000000000ff RDI: 000000000000080f
      RBP: ffff88046d96fd78 R8: 0000000000000150 R9: ffffe8ffffc20738
      R10: 0000000000000006 R11: ffffffff8102b430 R12: 0000000000000000
      R13: 0000000000000006 R14: 0000000000000006 R15: 00000000fffffffb
      ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
      — <NMI exception stack> —
      #7 [ffff88087fc23de0] native_halt at ffffffff810300b1
      #8 [ffff88087fc23de0] halt_current_cpu at ffffffff81024959
      #9 [ffff88087fc23df0] lkdb_main_loop at ffffffff812548ec
      #10 [ffff88087fc23ef0] kdba_main_loop at ffffffff8139bef2
      #11 [ffff88087fc23f20] kdb at ffffffff8125199f
      #12 [ffff88087fc23f80] kdb_ipi at ffffffff8124ea07
      #13 [ffff88087fc23f90] smp_kdb_interrupt at ffffffff8139b656
      #14 [ffff88087fc23fb0] kdb_interrupt at ffffffff8147aca3
      — <IRQ stack> —
      #15 [ffff88046d96fd78] kdb_interrupt at ffffffff8147aca3
      [exception RIP: _raw_spin_lock+24]
      RIP: ffffffff81471a88 RSP: ffff88046d96fe28 RFLAGS: 00000206
      RAX: 0000000000001700 RBX: ffff880867d28810 RCX: ffff880856c3be00
      RDX: 0000000000008000 RSI: ffff880856c3be00 RDI: ffff880430b100f8
      RBP: ffff880864634078 R8: 0000000000000002 R9: 0000000000000000
      R10: 0000000010000008 R11: 0000000000000000 R12: ffffffff8147ac9e
      R13: ffffffff811458be R14: ffff880867d28810 R15: 0000000000000206
      ORIG_RAX: ffffffffffffff01 CS: 0010 SS: 0018
      #16 [ffff88046d96fe28] osc_cleanup at ffffffffa0a48829 [osc]
      #17 [ffff88046d96fe38] class_decref at ffffffffa076eed4 [obdclass]
      #18 [ffff88046d96fea8] class_export_destroy at ffffffffa074c1de [obdclass]
      #19 [ffff88046d96fec8] obd_zombie_impexp_cull at ffffffffa074c61d [obdclass]
      #20 [ffff88046d96fee8] obd_zombie_impexp_thread at ffffffffa074c7bd [obdclass]
      #21 [ffff88046d96ff48] kernel_thread_helper at ffffffff8147aae4

      Attachments

        Issue Links

          Activity

            [LU-6173] CPU stalled with obd_zombid running
            pjones Peter Jones added a comment -

            Yes this is being worked on

            pjones Peter Jones added a comment - Yes this is being worked on

            Could you provide a 2.5 back port? Thanks!

            jaylan Jay Lan (Inactive) added a comment - Could you provide a 2.5 back port? Thanks!
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13746/
            Subject: LU-6173 llite: allocate and free client cache asynchronously
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 302c5bfebe61e988dbd27063becc4ef90befc6df

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13746/ Subject: LU-6173 llite: allocate and free client cache asynchronously Project: fs/lustre-release Branch: master Current Patch Set: Commit: 302c5bfebe61e988dbd27063becc4ef90befc6df

            Emoly Liu (emoly.liu@intel.com) uploaded a new patch: http://review.whamcloud.com/13746
            Subject: LU-6173 llite: allocate and free client cache asynchronously
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 570a48915a6935b8d180dafded4befaa2447b585

            gerrit Gerrit Updater added a comment - Emoly Liu (emoly.liu@intel.com) uploaded a new patch: http://review.whamcloud.com/13746 Subject: LU-6173 llite: allocate and free client cache asynchronously Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 570a48915a6935b8d180dafded4befaa2447b585
            emoly.liu Emoly Liu added a comment -

            Peter, yes both master and b2_5 need the patch. I will create one for master later.

            emoly.liu Emoly Liu added a comment - Peter, yes both master and b2_5 need the patch. I will create one for master later.
            pjones Peter Jones added a comment -

            Emoly

            Is this patch also required for master/b2_5?

            Peter

            pjones Peter Jones added a comment - Emoly Is this patch also required for master/b2_5? Peter

            People

              emoly.liu Emoly Liu
              jaylan Jay Lan (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: