Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.7.0, Lustre 2.4.3, Lustre 2.5.3
    • None
    • Git repo can be found at https://github.com/jlan/lustre-nas
      Server: centos 6.4 2.6.32-358.23.2.el6, lustre 2.4.3-12nasS
      Client: sles11sp3 3.0.101-0.31.1, lustre 2.4.3-11nasC
    • 3
    • 17274

    Description

      Yesterday experienced a network problem. Consequently, we had a number of clients stalled. At least four were hanged in this situation. We captured a vmcore on one of the systems.

      Console logs showed one of the CPUs was detected to stall:
      "INFO: rcu_sched_state detected stall on CPU 9."

      All CPU's at r305i7n2 except CPU 9 were running migration process and
      the rcu_sched_state detected CPU was running obd_zombid.
      The console logs of other three systems confirmed the stalled CPU were
      running obd_zombid also, but without vmcore I can not say for sure that
      other CPU's were running 'migration' as r305i7n2 did.

      The stack trace is:

      PID: 5070 TASK: ffff88046f086300 CPU: 9 COMMAND: "obd_zombid"
      #0 [ffff88087fc27e40] crash_nmi_callback at ffffffff810245af
      #1 [ffff88087fc27e50] notifier_call_chain at ffffffff81475847
      #2 [ffff88087fc27e80] __atomic_notifier_call_chain at ffffffff8147588d
      #3 [ffff88087fc27e90] notify_die at ffffffff814758dd
      #4 [ffff88087fc27ec0] default_do_nmi at ffffffff81472d37
      #5 [ffff88087fc27ee0] do_nmi at ffffffff81472f68
      #6 [ffff88087fc27ef0] restart_nmi at ffffffff814724b1
      [exception RIP: native_halt+1]
      RIP: ffffffff810300b1 RSP: ffff88087fc23de0 RFLAGS: 00000082
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000080f
      RDX: 0000000000000000 RSI: 00000000000000ff RDI: 000000000000080f
      RBP: ffff88046d96fd78 R8: 0000000000000150 R9: ffffe8ffffc20738
      R10: 0000000000000006 R11: ffffffff8102b430 R12: 0000000000000000
      R13: 0000000000000006 R14: 0000000000000006 R15: 00000000fffffffb
      ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
      — <NMI exception stack> —
      #7 [ffff88087fc23de0] native_halt at ffffffff810300b1
      #8 [ffff88087fc23de0] halt_current_cpu at ffffffff81024959
      #9 [ffff88087fc23df0] lkdb_main_loop at ffffffff812548ec
      #10 [ffff88087fc23ef0] kdba_main_loop at ffffffff8139bef2
      #11 [ffff88087fc23f20] kdb at ffffffff8125199f
      #12 [ffff88087fc23f80] kdb_ipi at ffffffff8124ea07
      #13 [ffff88087fc23f90] smp_kdb_interrupt at ffffffff8139b656
      #14 [ffff88087fc23fb0] kdb_interrupt at ffffffff8147aca3
      — <IRQ stack> —
      #15 [ffff88046d96fd78] kdb_interrupt at ffffffff8147aca3
      [exception RIP: _raw_spin_lock+24]
      RIP: ffffffff81471a88 RSP: ffff88046d96fe28 RFLAGS: 00000206
      RAX: 0000000000001700 RBX: ffff880867d28810 RCX: ffff880856c3be00
      RDX: 0000000000008000 RSI: ffff880856c3be00 RDI: ffff880430b100f8
      RBP: ffff880864634078 R8: 0000000000000002 R9: 0000000000000000
      R10: 0000000010000008 R11: 0000000000000000 R12: ffffffff8147ac9e
      R13: ffffffff811458be R14: ffff880867d28810 R15: 0000000000000206
      ORIG_RAX: ffffffffffffff01 CS: 0010 SS: 0018
      #16 [ffff88046d96fe28] osc_cleanup at ffffffffa0a48829 [osc]
      #17 [ffff88046d96fe38] class_decref at ffffffffa076eed4 [obdclass]
      #18 [ffff88046d96fea8] class_export_destroy at ffffffffa074c1de [obdclass]
      #19 [ffff88046d96fec8] obd_zombie_impexp_cull at ffffffffa074c61d [obdclass]
      #20 [ffff88046d96fee8] obd_zombie_impexp_thread at ffffffffa074c7bd [obdclass]
      #21 [ffff88046d96ff48] kernel_thread_helper at ffffffff8147aae4

      Attachments

        Issue Links

          Activity

            People

              emoly.liu Emoly Liu
              jaylan Jay Lan (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: