[LU-6173] CPU stalled with obd_zombid running - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.7.0, Lustre 2.4.3, Lustre 2.5.3
Labels:
None
Environment:
Git repo can be found at https://github.com/jlan/lustre-nas
Server: centos 6.4 2.6.32-358.23.2.el6, lustre 2.4.3-12nasS
Client: sles11sp3 3.0.101-0.31.1, lustre 2.4.3-11nasC

Severity:
3
Rank (Obsolete):
17274

Description

Yesterday experienced a network problem. Consequently, we had a number of clients stalled. At least four were hanged in this situation. We captured a vmcore on one of the systems.

Console logs showed one of the CPUs was detected to stall:
"INFO: rcu_sched_state detected stall on CPU 9."

All CPU's at r305i7n2 except CPU 9 were running migration process and
the rcu_sched_state detected CPU was running obd_zombid.
The console logs of other three systems confirmed the stalled CPU were
running obd_zombid also, but without vmcore I can not say for sure that
other CPU's were running 'migration' as r305i7n2 did.

The stack trace is:

PID: 5070 TASK: ffff88046f086300 CPU: 9 COMMAND: "obd_zombid"
#0 [ffff88087fc27e40] crash_nmi_callback at ffffffff810245af
#1 [ffff88087fc27e50] notifier_call_chain at ffffffff81475847
#2 [ffff88087fc27e80] __atomic_notifier_call_chain at ffffffff8147588d
#3 [ffff88087fc27e90] notify_die at ffffffff814758dd
#4 [ffff88087fc27ec0] default_do_nmi at ffffffff81472d37
#5 [ffff88087fc27ee0] do_nmi at ffffffff81472f68
#6 [ffff88087fc27ef0] restart_nmi at ffffffff814724b1
[exception RIP: native_halt+1]
RIP: ffffffff810300b1 RSP: ffff88087fc23de0 RFLAGS: 00000082
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000080f
RDX: 0000000000000000 RSI: 00000000000000ff RDI: 000000000000080f
RBP: ffff88046d96fd78 R8: 0000000000000150 R9: ffffe8ffffc20738
R10: 0000000000000006 R11: ffffffff8102b430 R12: 0000000000000000
R13: 0000000000000006 R14: 0000000000000006 R15: 00000000fffffffb
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
— <NMI exception stack> —
#7 [ffff88087fc23de0] native_halt at ffffffff810300b1
#8 [ffff88087fc23de0] halt_current_cpu at ffffffff81024959
#9 [ffff88087fc23df0] lkdb_main_loop at ffffffff812548ec
#10 [ffff88087fc23ef0] kdba_main_loop at ffffffff8139bef2
#11 [ffff88087fc23f20] kdb at ffffffff8125199f
#12 [ffff88087fc23f80] kdb_ipi at ffffffff8124ea07
#13 [ffff88087fc23f90] smp_kdb_interrupt at ffffffff8139b656
#14 [ffff88087fc23fb0] kdb_interrupt at ffffffff8147aca3
— <IRQ stack> —
#15 [ffff88046d96fd78] kdb_interrupt at ffffffff8147aca3
[exception RIP: _raw_spin_lock+24]
RIP: ffffffff81471a88 RSP: ffff88046d96fe28 RFLAGS: 00000206
RAX: 0000000000001700 RBX: ffff880867d28810 RCX: ffff880856c3be00
RDX: 0000000000008000 RSI: ffff880856c3be00 RDI: ffff880430b100f8
RBP: ffff880864634078 R8: 0000000000000002 R9: 0000000000000000
R10: 0000000010000008 R11: 0000000000000000 R12: ffffffff8147ac9e
R13: ffffffff811458be R14: ffff880867d28810 R15: 0000000000000206
ORIG_RAX: ffffffffffffff01 CS: 0010 SS: 0018
#16 [ffff88046d96fe28] osc_cleanup at ffffffffa0a48829 [osc]
#17 [ffff88046d96fe38] class_decref at ffffffffa076eed4 [obdclass]
#18 [ffff88046d96fea8] class_export_destroy at ffffffffa074c1de [obdclass]
#19 [ffff88046d96fec8] obd_zombie_impexp_cull at ffffffffa074c61d [obdclass]
#20 [ffff88046d96fee8] obd_zombie_impexp_thread at ffffffffa074c7bd [obdclass]
#21 [ffff88046d96ff48] kernel_thread_helper at ffffffff8147aae4

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

LU6173.crash-analysis.tgz
14 kB
05/Feb/15 2:48 AM
r305i7n2-20150128.bz2
313 kB
03/Feb/15 7:10 PM

Issue Links

mentioned in: Page No Confluence page found with the given URL.

Activity

[LU-6173] CPU stalled with obd_zombid running

Peter Jones added a comment - 25/May/15 10:41 PM

Landed for 2.8

Peter Jones added a comment - 25/May/15 10:41 PM Landed for 2.8

Gerrit Updater added a comment - 03/Mar/15 5:20 PM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13746/
Subject: ~~LU-6173~~ llite: allocate and free client cache asynchronously
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 302c5bfebe61e988dbd27063becc4ef90befc6df

Gerrit Updater added a comment - 03/Mar/15 5:20 PM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13746/ Subject: LU-6173 llite: allocate and free client cache asynchronously Project: fs/lustre-release Branch: master Current Patch Set: Commit: 302c5bfebe61e988dbd27063becc4ef90befc6df

Gerrit Updater added a comment - 12/Feb/15 2:12 PM

Emoly Liu (emoly.liu@intel.com) uploaded a new patch: http://review.whamcloud.com/13746
Subject: ~~LU-6173~~ llite: allocate and free client cache asynchronously
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 570a48915a6935b8d180dafded4befaa2447b585

Gerrit Updater added a comment - 12/Feb/15 2:12 PM Emoly Liu (emoly.liu@intel.com) uploaded a new patch: http://review.whamcloud.com/13746 Subject: LU-6173 llite: allocate and free client cache asynchronously Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 570a48915a6935b8d180dafded4befaa2447b585

Emoly Liu added a comment - 12/Feb/15 12:47 AM

Peter, yes both master and b2_5 need the patch. I will create one for master later.

Emoly Liu added a comment - 12/Feb/15 12:47 AM Peter, yes both master and b2_5 need the patch. I will create one for master later.

Peter Jones added a comment - 11/Feb/15 1:55 PM

Emoly

Is this patch also required for master/b2_5?

Peter

Peter Jones added a comment - 11/Feb/15 1:55 PM Emoly Is this patch also required for master/b2_5? Peter

Emoly Liu added a comment - 11/Feb/15 9:51 AM

Thanks for Niu&Oleg's help! I pushed a patch for b2_4 for review.

Emoly Liu added a comment - 11/Feb/15 9:51 AM Thanks for Niu&Oleg's help! I pushed a patch for b2_4 for review.

Gerrit Updater added a comment - 11/Feb/15 9:48 AM

Emoly Liu (emoly.liu@intel.com) uploaded a new patch: http://review.whamcloud.com/13727
Subject: ~~LU-6173~~ llite: allocate and free client cache asynchronously
Project: fs/lustre-release
Branch: b2_4
Current Patch Set: 1
Commit: ae23e1e99d072c3865ca2da538705eb61fc6c7c2

Gerrit Updater added a comment - 11/Feb/15 9:48 AM Emoly Liu (emoly.liu@intel.com) uploaded a new patch: http://review.whamcloud.com/13727 Subject: LU-6173 llite: allocate and free client cache asynchronously Project: fs/lustre-release Branch: b2_4 Current Patch Set: 1 Commit: ae23e1e99d072c3865ca2da538705eb61fc6c7c2

Oleg Drokin added a comment - 10/Feb/15 4:05 PM

Niu: It's right in the __ptlrpc_request_alloc():

                request->rq_import = class_import_get(imp);

and the import stays put until all requests are drained, which might take awhile if the requests are stuck on the network.

Oleg Drokin added a comment - 10/Feb/15 4:05 PM Niu: It's right in the __ptlrpc_request_alloc(): request->rq_import = class_import_get(imp); and the import stays put until all requests are drained, which might take awhile if the requests are stuck on the network.

Niu Yawei (Inactive) added a comment - 10/Feb/15 8:23 AM

So, examining the disconnect code, it looks like client_common_put_super assumes the mere call to obd_disconnect(sbi->ll_dt_exp); just marks the import disconnected, but if there are any requests in flight (highly likely if you have a broken connection and requests take seconds to timeout), then the actual final import put would not happen until this last request is finished (every request holds an import reference), and only then the final class_import_put() would happen that would call obd_zombie_import_add() increasing the zombie task list count and would stall obd_zombie_barrier().
So the "fix" for ~~LU-2543~~ really failed to consider this scenario of inflight requests for all imports.

Will the inflight RPC hold the OSC export refcount as well? I was thinking that obd_disconnect() in client_common_put_super() shall put the last refcount of OSC export and make the umount wait in obd_zombie_barrier().

Niu Yawei (Inactive) added a comment - 10/Feb/15 8:23 AM So, examining the disconnect code, it looks like client_common_put_super assumes the mere call to obd_disconnect(sbi->ll_dt_exp); just marks the import disconnected, but if there are any requests in flight (highly likely if you have a broken connection and requests take seconds to timeout), then the actual final import put would not happen until this last request is finished (every request holds an import reference), and only then the final class_import_put() would happen that would call obd_zombie_import_add() increasing the zombie task list count and would stall obd_zombie_barrier(). So the "fix" for LU-2543 really failed to consider this scenario of inflight requests for all imports. Will the inflight RPC hold the OSC export refcount as well? I was thinking that obd_disconnect() in client_common_put_super() shall put the last refcount of OSC export and make the umount wait in obd_zombie_barrier().

Oleg Drokin added a comment - 10/Feb/15 4:28 AM

So, poking around in the crashdump, it looks like it is indeed something very similar to ~~LU-2543~~.
What we are seeing in the log is that the filesystem with name nbp6 is being unmounted while there are some communication problems to its OSTs (probably network hiccup mentioned).
By the time the crash happened the umount of nbp6 was already unmounted and the sbi structure freed, but OSC cleanups are still ongoing and those do access content of the sbi struct (ll_cache member of it). Since it contains garbage, attempt to geta spinlock fails.
This is evident since the only two left lustre filesystems mounted are nbp5 and nbp9.

So, examining the disconnect code, it looks like client_common_put_super assumes the mere call to obd_disconnect(sbi->ll_dt_exp); just marks the import disconnected, but if there are any requests in flight (highly likely if you have a broken connection and requests take seconds to timeout), then the actual final import put would not happen until this last request is finished (every request holds an import reference), and only then the final class_import_put() would happen that would call obd_zombie_import_add() increasing the zombie task list count and would stall obd_zombie_barrier().

So the "fix" for ~~LU-2543~~ really failed to consider this scenario of inflight requests for all imports.
I see that ll_cache itself has a refcounter inside it, and perhaps that might be a much better proxy to determine when we are fine freeing the sbi struct. Niu?

Actually I guess that would lead to unmount hanging until all requests finish processing which might not be ideal either in the face of broken connection, so potentially sbi freeing could be made asynchronous too.
This bug exists in master too, btw.

Oleg Drokin added a comment - 10/Feb/15 4:28 AM So, poking around in the crashdump, it looks like it is indeed something very similar to LU-2543 . What we are seeing in the log is that the filesystem with name nbp6 is being unmounted while there are some communication problems to its OSTs (probably network hiccup mentioned). By the time the crash happened the umount of nbp6 was already unmounted and the sbi structure freed, but OSC cleanups are still ongoing and those do access content of the sbi struct (ll_cache member of it). Since it contains garbage, attempt to geta spinlock fails. This is evident since the only two left lustre filesystems mounted are nbp5 and nbp9. So, examining the disconnect code, it looks like client_common_put_super assumes the mere call to obd_disconnect(sbi->ll_dt_exp); just marks the import disconnected, but if there are any requests in flight (highly likely if you have a broken connection and requests take seconds to timeout), then the actual final import put would not happen until this last request is finished (every request holds an import reference), and only then the final class_import_put() would happen that would call obd_zombie_import_add() increasing the zombie task list count and would stall obd_zombie_barrier(). So the "fix" for LU-2543 really failed to consider this scenario of inflight requests for all imports. I see that ll_cache itself has a refcounter inside it, and perhaps that might be a much better proxy to determine when we are fine freeing the sbi struct. Niu? Actually I guess that would lead to unmount hanging until all requests finish processing which might not be ideal either in the face of broken connection, so potentially sbi freeing could be made asynchronous too. This bug exists in master too, btw.

Jay Lan (Inactive) added a comment - 06/Feb/15 6:32 PM

Hi Oleg,

The lustre client debuginfo rpm has been uploaded to ftp.whamcloud.com. I attached ".~~LU-6173~~" to the end of the rpm file.

Jay Lan (Inactive) added a comment - 06/Feb/15 6:32 PM Hi Oleg, The lustre client debuginfo rpm has been uploaded to ftp.whamcloud.com. I attached ". LU-6173 " to the end of the rpm file.

People

Assignee:: Emoly Liu

Reporter:: Jay Lan (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 28/Jan/15 11:52 PM

Updated:: 14/Jun/18 9:41 PM

Resolved:: 25/May/15 10:41 PM