Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.7.0, Lustre 2.4.3, Lustre 2.5.3
    • None
    • Git repo can be found at https://github.com/jlan/lustre-nas
      Server: centos 6.4 2.6.32-358.23.2.el6, lustre 2.4.3-12nasS
      Client: sles11sp3 3.0.101-0.31.1, lustre 2.4.3-11nasC
    • 3
    • 17274

    Description

      Yesterday experienced a network problem. Consequently, we had a number of clients stalled. At least four were hanged in this situation. We captured a vmcore on one of the systems.

      Console logs showed one of the CPUs was detected to stall:
      "INFO: rcu_sched_state detected stall on CPU 9."

      All CPU's at r305i7n2 except CPU 9 were running migration process and
      the rcu_sched_state detected CPU was running obd_zombid.
      The console logs of other three systems confirmed the stalled CPU were
      running obd_zombid also, but without vmcore I can not say for sure that
      other CPU's were running 'migration' as r305i7n2 did.

      The stack trace is:

      PID: 5070 TASK: ffff88046f086300 CPU: 9 COMMAND: "obd_zombid"
      #0 [ffff88087fc27e40] crash_nmi_callback at ffffffff810245af
      #1 [ffff88087fc27e50] notifier_call_chain at ffffffff81475847
      #2 [ffff88087fc27e80] __atomic_notifier_call_chain at ffffffff8147588d
      #3 [ffff88087fc27e90] notify_die at ffffffff814758dd
      #4 [ffff88087fc27ec0] default_do_nmi at ffffffff81472d37
      #5 [ffff88087fc27ee0] do_nmi at ffffffff81472f68
      #6 [ffff88087fc27ef0] restart_nmi at ffffffff814724b1
      [exception RIP: native_halt+1]
      RIP: ffffffff810300b1 RSP: ffff88087fc23de0 RFLAGS: 00000082
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000080f
      RDX: 0000000000000000 RSI: 00000000000000ff RDI: 000000000000080f
      RBP: ffff88046d96fd78 R8: 0000000000000150 R9: ffffe8ffffc20738
      R10: 0000000000000006 R11: ffffffff8102b430 R12: 0000000000000000
      R13: 0000000000000006 R14: 0000000000000006 R15: 00000000fffffffb
      ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
      — <NMI exception stack> —
      #7 [ffff88087fc23de0] native_halt at ffffffff810300b1
      #8 [ffff88087fc23de0] halt_current_cpu at ffffffff81024959
      #9 [ffff88087fc23df0] lkdb_main_loop at ffffffff812548ec
      #10 [ffff88087fc23ef0] kdba_main_loop at ffffffff8139bef2
      #11 [ffff88087fc23f20] kdb at ffffffff8125199f
      #12 [ffff88087fc23f80] kdb_ipi at ffffffff8124ea07
      #13 [ffff88087fc23f90] smp_kdb_interrupt at ffffffff8139b656
      #14 [ffff88087fc23fb0] kdb_interrupt at ffffffff8147aca3
      — <IRQ stack> —
      #15 [ffff88046d96fd78] kdb_interrupt at ffffffff8147aca3
      [exception RIP: _raw_spin_lock+24]
      RIP: ffffffff81471a88 RSP: ffff88046d96fe28 RFLAGS: 00000206
      RAX: 0000000000001700 RBX: ffff880867d28810 RCX: ffff880856c3be00
      RDX: 0000000000008000 RSI: ffff880856c3be00 RDI: ffff880430b100f8
      RBP: ffff880864634078 R8: 0000000000000002 R9: 0000000000000000
      R10: 0000000010000008 R11: 0000000000000000 R12: ffffffff8147ac9e
      R13: ffffffff811458be R14: ffff880867d28810 R15: 0000000000000206
      ORIG_RAX: ffffffffffffff01 CS: 0010 SS: 0018
      #16 [ffff88046d96fe28] osc_cleanup at ffffffffa0a48829 [osc]
      #17 [ffff88046d96fe38] class_decref at ffffffffa076eed4 [obdclass]
      #18 [ffff88046d96fea8] class_export_destroy at ffffffffa074c1de [obdclass]
      #19 [ffff88046d96fec8] obd_zombie_impexp_cull at ffffffffa074c61d [obdclass]
      #20 [ffff88046d96fee8] obd_zombie_impexp_thread at ffffffffa074c7bd [obdclass]
      #21 [ffff88046d96ff48] kernel_thread_helper at ffffffff8147aae4

      Attachments

        Issue Links

          Activity

            [LU-6173] CPU stalled with obd_zombid running
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13746/
            Subject: LU-6173 llite: allocate and free client cache asynchronously
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 302c5bfebe61e988dbd27063becc4ef90befc6df

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13746/ Subject: LU-6173 llite: allocate and free client cache asynchronously Project: fs/lustre-release Branch: master Current Patch Set: Commit: 302c5bfebe61e988dbd27063becc4ef90befc6df

            Emoly Liu (emoly.liu@intel.com) uploaded a new patch: http://review.whamcloud.com/13746
            Subject: LU-6173 llite: allocate and free client cache asynchronously
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 570a48915a6935b8d180dafded4befaa2447b585

            gerrit Gerrit Updater added a comment - Emoly Liu (emoly.liu@intel.com) uploaded a new patch: http://review.whamcloud.com/13746 Subject: LU-6173 llite: allocate and free client cache asynchronously Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 570a48915a6935b8d180dafded4befaa2447b585
            emoly.liu Emoly Liu added a comment -

            Peter, yes both master and b2_5 need the patch. I will create one for master later.

            emoly.liu Emoly Liu added a comment - Peter, yes both master and b2_5 need the patch. I will create one for master later.
            pjones Peter Jones added a comment -

            Emoly

            Is this patch also required for master/b2_5?

            Peter

            pjones Peter Jones added a comment - Emoly Is this patch also required for master/b2_5? Peter
            emoly.liu Emoly Liu added a comment -

            Thanks for Niu&Oleg's help! I pushed a patch for b2_4 for review.

            emoly.liu Emoly Liu added a comment - Thanks for Niu&Oleg's help! I pushed a patch for b2_4 for review.

            Emoly Liu (emoly.liu@intel.com) uploaded a new patch: http://review.whamcloud.com/13727
            Subject: LU-6173 llite: allocate and free client cache asynchronously
            Project: fs/lustre-release
            Branch: b2_4
            Current Patch Set: 1
            Commit: ae23e1e99d072c3865ca2da538705eb61fc6c7c2

            gerrit Gerrit Updater added a comment - Emoly Liu (emoly.liu@intel.com) uploaded a new patch: http://review.whamcloud.com/13727 Subject: LU-6173 llite: allocate and free client cache asynchronously Project: fs/lustre-release Branch: b2_4 Current Patch Set: 1 Commit: ae23e1e99d072c3865ca2da538705eb61fc6c7c2
            green Oleg Drokin added a comment -

            Niu: It's right in the __ptlrpc_request_alloc():

                            request->rq_import = class_import_get(imp);
            

            and the import stays put until all requests are drained, which might take awhile if the requests are stuck on the network.

            green Oleg Drokin added a comment - Niu: It's right in the __ptlrpc_request_alloc(): request->rq_import = class_import_get(imp); and the import stays put until all requests are drained, which might take awhile if the requests are stuck on the network.

            So, examining the disconnect code, it looks like client_common_put_super assumes the mere call to obd_disconnect(sbi->ll_dt_exp); just marks the import disconnected, but if there are any requests in flight (highly likely if you have a broken connection and requests take seconds to timeout), then the actual final import put would not happen until this last request is finished (every request holds an import reference), and only then the final class_import_put() would happen that would call obd_zombie_import_add() increasing the zombie task list count and would stall obd_zombie_barrier().
            So the "fix" for LU-2543 really failed to consider this scenario of inflight requests for all imports.

            Will the inflight RPC hold the OSC export refcount as well? I was thinking that obd_disconnect() in client_common_put_super() shall put the last refcount of OSC export and make the umount wait in obd_zombie_barrier().

            niu Niu Yawei (Inactive) added a comment - So, examining the disconnect code, it looks like client_common_put_super assumes the mere call to obd_disconnect(sbi->ll_dt_exp); just marks the import disconnected, but if there are any requests in flight (highly likely if you have a broken connection and requests take seconds to timeout), then the actual final import put would not happen until this last request is finished (every request holds an import reference), and only then the final class_import_put() would happen that would call obd_zombie_import_add() increasing the zombie task list count and would stall obd_zombie_barrier(). So the "fix" for LU-2543 really failed to consider this scenario of inflight requests for all imports. Will the inflight RPC hold the OSC export refcount as well? I was thinking that obd_disconnect() in client_common_put_super() shall put the last refcount of OSC export and make the umount wait in obd_zombie_barrier().
            green Oleg Drokin added a comment -

            So, poking around in the crashdump, it looks like it is indeed something very similar to LU-2543.
            What we are seeing in the log is that the filesystem with name nbp6 is being unmounted while there are some communication problems to its OSTs (probably network hiccup mentioned).
            By the time the crash happened the umount of nbp6 was already unmounted and the sbi structure freed, but OSC cleanups are still ongoing and those do access content of the sbi struct (ll_cache member of it). Since it contains garbage, attempt to geta spinlock fails.
            This is evident since the only two left lustre filesystems mounted are nbp5 and nbp9.

            So, examining the disconnect code, it looks like client_common_put_super assumes the mere call to obd_disconnect(sbi->ll_dt_exp); just marks the import disconnected, but if there are any requests in flight (highly likely if you have a broken connection and requests take seconds to timeout), then the actual final import put would not happen until this last request is finished (every request holds an import reference), and only then the final class_import_put() would happen that would call obd_zombie_import_add() increasing the zombie task list count and would stall obd_zombie_barrier().

            So the "fix" for LU-2543 really failed to consider this scenario of inflight requests for all imports.
            I see that ll_cache itself has a refcounter inside it, and perhaps that might be a much better proxy to determine when we are fine freeing the sbi struct. Niu?

            Actually I guess that would lead to unmount hanging until all requests finish processing which might not be ideal either in the face of broken connection, so potentially sbi freeing could be made asynchronous too.
            This bug exists in master too, btw.

            green Oleg Drokin added a comment - So, poking around in the crashdump, it looks like it is indeed something very similar to LU-2543 . What we are seeing in the log is that the filesystem with name nbp6 is being unmounted while there are some communication problems to its OSTs (probably network hiccup mentioned). By the time the crash happened the umount of nbp6 was already unmounted and the sbi structure freed, but OSC cleanups are still ongoing and those do access content of the sbi struct (ll_cache member of it). Since it contains garbage, attempt to geta spinlock fails. This is evident since the only two left lustre filesystems mounted are nbp5 and nbp9. So, examining the disconnect code, it looks like client_common_put_super assumes the mere call to obd_disconnect(sbi->ll_dt_exp); just marks the import disconnected, but if there are any requests in flight (highly likely if you have a broken connection and requests take seconds to timeout), then the actual final import put would not happen until this last request is finished (every request holds an import reference), and only then the final class_import_put() would happen that would call obd_zombie_import_add() increasing the zombie task list count and would stall obd_zombie_barrier(). So the "fix" for LU-2543 really failed to consider this scenario of inflight requests for all imports. I see that ll_cache itself has a refcounter inside it, and perhaps that might be a much better proxy to determine when we are fine freeing the sbi struct. Niu? Actually I guess that would lead to unmount hanging until all requests finish processing which might not be ideal either in the face of broken connection, so potentially sbi freeing could be made asynchronous too. This bug exists in master too, btw.

            Hi Oleg,

            The lustre client debuginfo rpm has been uploaded to ftp.whamcloud.com. I attached ".LU-6173" to the end of the rpm file.

            jaylan Jay Lan (Inactive) added a comment - Hi Oleg, The lustre client debuginfo rpm has been uploaded to ftp.whamcloud.com. I attached ". LU-6173 " to the end of the rpm file.

            People

              emoly.liu Emoly Liu
              jaylan Jay Lan (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: