[LU-6173] CPU stalled with obd_zombid running Created: 28/Jan/15  Updated: 14/Jun/18  Resolved: 25/May/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.4.3, Lustre 2.5.3
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: Jay Lan (Inactive) Assignee: Emoly Liu
Resolution: Fixed Votes: 0
Labels: None
Environment:

Git repo can be found at https://github.com/jlan/lustre-nas
Server: centos 6.4 2.6.32-358.23.2.el6, lustre 2.4.3-12nasS
Client: sles11sp3 3.0.101-0.31.1, lustre 2.4.3-11nasC


Attachments: File LU6173.crash-analysis.tgz     File r305i7n2-20150128.bz2    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 17274

 Description   

Yesterday experienced a network problem. Consequently, we had a number of clients stalled. At least four were hanged in this situation. We captured a vmcore on one of the systems.

Console logs showed one of the CPUs was detected to stall:
"INFO: rcu_sched_state detected stall on CPU 9."

All CPU's at r305i7n2 except CPU 9 were running migration process and
the rcu_sched_state detected CPU was running obd_zombid.
The console logs of other three systems confirmed the stalled CPU were
running obd_zombid also, but without vmcore I can not say for sure that
other CPU's were running 'migration' as r305i7n2 did.

The stack trace is:

PID: 5070 TASK: ffff88046f086300 CPU: 9 COMMAND: "obd_zombid"
#0 [ffff88087fc27e40] crash_nmi_callback at ffffffff810245af
#1 [ffff88087fc27e50] notifier_call_chain at ffffffff81475847
#2 [ffff88087fc27e80] __atomic_notifier_call_chain at ffffffff8147588d
#3 [ffff88087fc27e90] notify_die at ffffffff814758dd
#4 [ffff88087fc27ec0] default_do_nmi at ffffffff81472d37
#5 [ffff88087fc27ee0] do_nmi at ffffffff81472f68
#6 [ffff88087fc27ef0] restart_nmi at ffffffff814724b1
[exception RIP: native_halt+1]
RIP: ffffffff810300b1 RSP: ffff88087fc23de0 RFLAGS: 00000082
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000080f
RDX: 0000000000000000 RSI: 00000000000000ff RDI: 000000000000080f
RBP: ffff88046d96fd78 R8: 0000000000000150 R9: ffffe8ffffc20738
R10: 0000000000000006 R11: ffffffff8102b430 R12: 0000000000000000
R13: 0000000000000006 R14: 0000000000000006 R15: 00000000fffffffb
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
— <NMI exception stack> —
#7 [ffff88087fc23de0] native_halt at ffffffff810300b1
#8 [ffff88087fc23de0] halt_current_cpu at ffffffff81024959
#9 [ffff88087fc23df0] lkdb_main_loop at ffffffff812548ec
#10 [ffff88087fc23ef0] kdba_main_loop at ffffffff8139bef2
#11 [ffff88087fc23f20] kdb at ffffffff8125199f
#12 [ffff88087fc23f80] kdb_ipi at ffffffff8124ea07
#13 [ffff88087fc23f90] smp_kdb_interrupt at ffffffff8139b656
#14 [ffff88087fc23fb0] kdb_interrupt at ffffffff8147aca3
— <IRQ stack> —
#15 [ffff88046d96fd78] kdb_interrupt at ffffffff8147aca3
[exception RIP: _raw_spin_lock+24]
RIP: ffffffff81471a88 RSP: ffff88046d96fe28 RFLAGS: 00000206
RAX: 0000000000001700 RBX: ffff880867d28810 RCX: ffff880856c3be00
RDX: 0000000000008000 RSI: ffff880856c3be00 RDI: ffff880430b100f8
RBP: ffff880864634078 R8: 0000000000000002 R9: 0000000000000000
R10: 0000000010000008 R11: 0000000000000000 R12: ffffffff8147ac9e
R13: ffffffff811458be R14: ffff880867d28810 R15: 0000000000000206
ORIG_RAX: ffffffffffffff01 CS: 0010 SS: 0018
#16 [ffff88046d96fe28] osc_cleanup at ffffffffa0a48829 [osc]
#17 [ffff88046d96fe38] class_decref at ffffffffa076eed4 [obdclass]
#18 [ffff88046d96fea8] class_export_destroy at ffffffffa074c1de [obdclass]
#19 [ffff88046d96fec8] obd_zombie_impexp_cull at ffffffffa074c61d [obdclass]
#20 [ffff88046d96fee8] obd_zombie_impexp_thread at ffffffffa074c7bd [obdclass]
#21 [ffff88046d96ff48] kernel_thread_helper at ffffffff8147aae4



 Comments   
Comment by Peter Jones [ 29/Jan/15 ]

Emoly

Could you please advise?

Peter

Comment by Emoly Liu [ 03/Feb/15 ]

Jay, could you please upload the full vmcore file for further investigation? Thanks.

Comment by Emoly Liu [ 03/Feb/15 ]

BTW, do you have the dmesg log? I want to know what happened during the network problem.

Comment by Jay Lan (Inactive) [ 03/Feb/15 ]

File r305i7n2-20150128.bz2 contains the console log of the system when the system getting into trouble. Since stack traces of all CPU's were printed every 10 minutes, the network errors messages were flushed out from the demsg buffer, so I attached the console log here instead.

User ID in the log has been replaced with 'xxx'.

Vmcore can only be seen by US citizen. I can send crash analysis information to you if that is OK. Otherwise, I will consult with other guys on how to send you encrypted vmcore. Please advise.

Comment by Emoly Liu [ 05/Feb/15 ]

Jay, I am not a US citizen. You can send the crash analysis information to me first. If necessary, I will ask for my other US citizen colleague to help.

Comment by Jay Lan (Inactive) [ 05/Feb/15 ]

This tarball contains output of 'bt -a', 'ps -a', 'kmem -i' and 'kmem -s'.

Let me know if you want me to provide you output of other crash commands.

Comment by Oleg Drokin [ 05/Feb/15 ]

Hm, this really reminds me of LU-2543 that I was hitting in pre 2.4 times, but we had a pretty effective patch for that that you do have in your tree.
I wonder if there was another corner case left that was closed later.

Comment by Jay Lan (Inactive) [ 05/Feb/15 ]

Slightly different. LU-2543 resulted in a kernel panic while we experienced system hang in this case.

Comment by Oleg Drokin [ 06/Feb/15 ]

Well, you see - I hit kernel panic in LU-2543 because I was running with a special kernel with a lot of extra debug turned on. So every time there's access to freed memory - it'l crash, or if there's invalid spinlock magic - It'll crash and so on.
Those options are very heavy weight, so nobody runs with them in production.
So in your kernel if you encounter such a spinlock with random garbage in memory (there is not even a magic in the non-debug code so that it could be faster) - it'll just take this garbage at face value and will wait until the other holed (imagined) will go away and since there's nobody actually there holding tha tspinlock - the code will wait there forever - this is what we see in your backtrace.

And that's why I think it's something very similar.

Comment by Emoly Liu [ 06/Feb/15 ]

I ever suspected LU-2543 too, but its patch has been included. This issue needs more investigation.

Comment by Oleg Drokin [ 06/Feb/15 ]

Yes, it does need more investigation, no question about that.
It does look like this is another case of use after free to me.
I need lustre kernel modules with debug info from your build to get some more info, please.

Comment by Jay Lan (Inactive) [ 06/Feb/15 ]

Hi Oleg,

The lustre client debuginfo rpm has been uploaded to ftp.whamcloud.com. I attached ".LU-6173" to the end of the rpm file.

Comment by Oleg Drokin [ 10/Feb/15 ]

So, poking around in the crashdump, it looks like it is indeed something very similar to LU-2543.
What we are seeing in the log is that the filesystem with name nbp6 is being unmounted while there are some communication problems to its OSTs (probably network hiccup mentioned).
By the time the crash happened the umount of nbp6 was already unmounted and the sbi structure freed, but OSC cleanups are still ongoing and those do access content of the sbi struct (ll_cache member of it). Since it contains garbage, attempt to geta spinlock fails.
This is evident since the only two left lustre filesystems mounted are nbp5 and nbp9.

So, examining the disconnect code, it looks like client_common_put_super assumes the mere call to obd_disconnect(sbi->ll_dt_exp); just marks the import disconnected, but if there are any requests in flight (highly likely if you have a broken connection and requests take seconds to timeout), then the actual final import put would not happen until this last request is finished (every request holds an import reference), and only then the final class_import_put() would happen that would call obd_zombie_import_add() increasing the zombie task list count and would stall obd_zombie_barrier().

So the "fix" for LU-2543 really failed to consider this scenario of inflight requests for all imports.
I see that ll_cache itself has a refcounter inside it, and perhaps that might be a much better proxy to determine when we are fine freeing the sbi struct. Niu?

Actually I guess that would lead to unmount hanging until all requests finish processing which might not be ideal either in the face of broken connection, so potentially sbi freeing could be made asynchronous too.
This bug exists in master too, btw.

Comment by Niu Yawei (Inactive) [ 10/Feb/15 ]

So, examining the disconnect code, it looks like client_common_put_super assumes the mere call to obd_disconnect(sbi->ll_dt_exp); just marks the import disconnected, but if there are any requests in flight (highly likely if you have a broken connection and requests take seconds to timeout), then the actual final import put would not happen until this last request is finished (every request holds an import reference), and only then the final class_import_put() would happen that would call obd_zombie_import_add() increasing the zombie task list count and would stall obd_zombie_barrier().
So the "fix" for LU-2543 really failed to consider this scenario of inflight requests for all imports.

Will the inflight RPC hold the OSC export refcount as well? I was thinking that obd_disconnect() in client_common_put_super() shall put the last refcount of OSC export and make the umount wait in obd_zombie_barrier().

Comment by Oleg Drokin [ 10/Feb/15 ]

Niu: It's right in the __ptlrpc_request_alloc():

                request->rq_import = class_import_get(imp);

and the import stays put until all requests are drained, which might take awhile if the requests are stuck on the network.

Comment by Gerrit Updater [ 11/Feb/15 ]

Emoly Liu (emoly.liu@intel.com) uploaded a new patch: http://review.whamcloud.com/13727
Subject: LU-6173 llite: allocate and free client cache asynchronously
Project: fs/lustre-release
Branch: b2_4
Current Patch Set: 1
Commit: ae23e1e99d072c3865ca2da538705eb61fc6c7c2

Comment by Emoly Liu [ 11/Feb/15 ]

Thanks for Niu&Oleg's help! I pushed a patch for b2_4 for review.

Comment by Peter Jones [ 11/Feb/15 ]

Emoly

Is this patch also required for master/b2_5?

Peter

Comment by Emoly Liu [ 12/Feb/15 ]

Peter, yes both master and b2_5 need the patch. I will create one for master later.

Comment by Gerrit Updater [ 12/Feb/15 ]

Emoly Liu (emoly.liu@intel.com) uploaded a new patch: http://review.whamcloud.com/13746
Subject: LU-6173 llite: allocate and free client cache asynchronously
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 570a48915a6935b8d180dafded4befaa2447b585

Comment by Gerrit Updater [ 03/Mar/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13746/
Subject: LU-6173 llite: allocate and free client cache asynchronously
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 302c5bfebe61e988dbd27063becc4ef90befc6df

Comment by Peter Jones [ 25/May/15 ]

Landed for 2.8

Comment by Jay Lan (Inactive) [ 26/May/15 ]

Could you provide a 2.5 back port? Thanks!

Comment by Peter Jones [ 27/May/15 ]

Yes this is being worked on

Generated at Sat Feb 10 01:57:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.