[LU-2327] Clients stuck in mount Created: 14/Nov/12  Updated: 15/Apr/20  Resolved: 15/Apr/20

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Christopher Morrone Assignee: Alex Zhuravlev
Resolution: Cannot Reproduce Votes: 0
Labels: llnl, ptr
Environment:

Sequoia, 2.3.54-2chaos, github.com/chaos/lustre


Attachments: Text File lustre-log.1357681275.3261.txt     Text File seqio162_console.txt     Text File seqio162_lustre_log.txt     File seqio542_console_sysrq.log     File seqio542_lustre.log     File seqio652_lustre.log     File seqio652_lustre2.log    
Issue Links:
Related
is related to LU-2572 Test LU-2263 and LU-2327 patches on S... Resolved
Sub-Tasks:
Key
Summary
Type
Status
Assignee
LU-2572 Test LU-2263 and LU-2327 patches on S... Technical task Resolved Prakash Surya  
Severity: 3
Rank (Obsolete): 5558

 Description   

We have another instance of clients getting stuck at mount time. Out of 768 client, 766 mounted successfully, and 2 appear to be stuck. So far it has been several minutes. They both had the same soft lockup messages on the console:

LustreError: 20578:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2
BUG: soft lockup - CPU#33 stuck for 68s! [ll_ping:3244]
Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm
NIP: c00000000042e190 LR: 8000000003a9ff2c CTR: c00000000042e160
REGS: c0000003ca60fba0 TRAP: 0901   Not tainted  (2.6.32-220.23.3.bgq.13llnl.V1R1M2.bgq62_16.ppc64)
MSR: 0000000080029000 <EE,ME,CE>  CR: 84228424  XER: 20000000
TASK = c0000003ca51e600[3244] 'll_ping' THREAD: c0000003ca60c000 CPU: 33
GPR00: 0000000080000021 c0000003ca60fe20 c0000000006de510 c0000003c00c0a78
GPR04: 2222222222222222 0000000000000000 c0000003ca60fd38 0000000000000000
GPR08: 0000000000000000 0000000080000021 0000000000000000 c00000000042e160
GPR12: 8000000003aebda8 c00000000075d200
NIP [c00000000042e190] ._spin_lock+0x30/0x44
LR [8000000003a9ff2c] .ptlrpc_pinger_main+0x19c/0xcc0 [ptlrpc]
Call Trace:
[c0000003ca60fe20] [8000000003a9fea0] .ptlrpc_pinger_main+0x110/0xcc0 [ptlrpc] (unreliable)
[c0000003ca60ff90] [c00000000001a9e0] .kernel_thread+0x54/0x70
Instruction dump:
38000000 980d0c94 812d0000 7c001829 2c000000 40c20010 7d20192d 40c2fff0
4c00012c 2fa00000 4dfe0020 7c210b78 <80030000> 2fa00000 40defff4 7c421378
BUG: soft lockup - CPU#49 stuck for 68s! [ptlrpcd_rcv:3175]
Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm
NIP: c00000000042e198 LR: 8000000003a66af4 CTR: c00000000042e160
REGS: c0000003e9bf3900 TRAP: 0901   Not tainted  (2.6.32-220.23.3.bgq.13llnl.V1R1M2.bgq62_16.ppc64)
MSR: 0000000080029000 <EE,ME,CE>  CR: 84228444  XER: 20000000
TASK = c0000003ea1f0740[3175] 'ptlrpcd_rcv' THREAD: c0000003e9bf0000 CPU: 49
GPR00: 0000000080000021 c0000003e9bf3b80 c0000000006de510 c0000003c00c0a78
GPR04: 0000000000000001 0000000000000000 0000000000000000 0000000000000000
GPR08: c0000003c00c0a60 0000000080000031 c00000036a060900 c00000000042e160
GPR12: 8000000003aebda8 c00000000076a200
NIP [c00000000042e198] ._spin_lock+0x38/0x44
LR [8000000003a66af4] .ptlrpc_check_set+0x4f4/0x4e80 [ptlrpc]
Call Trace:
[c0000003e9bf3b80] [8000000003a66964] .ptlrpc_check_set+0x364/0x4e80 [ptlrpc] (unreliable)
[c0000003e9bf3d20] [8000000003abd1cc] .ptlrpcd_check+0x66c/0x8a0 [ptlrpc]
[c0000003e9bf3e40] [8000000003abd6b8] .ptlrpcd+0x2b8/0x510 [ptlrpc]
[c0000003e9bf3f90] [c00000000001a9e0] .kernel_thread+0x54/0x70
Instruction dump:
812d0000 7c001829 2c000000 40c20010 7d20192d 40c2fff0 4c00012c 2fa00000
4dfe0020 7c210b78 80030000 2fa00000 <40defff4> 7c421378 4bffffc8 7c0802a6

Both nodes are pretty much idle. I'll see what other information I can gather.



 Comments   
Comment by Christopher Morrone [ 14/Nov/12 ]

Actually, there is a pretty common low level of activity from ldlm_poold, and ptlrpcds appear now and then in top.

Comment by Christopher Morrone [ 14/Nov/12 ]

Attaching log files. The following files:

seqio652_lustre.log
seqio542_lustre.log

Were just an "lctl dk" on the nodes that were hung in mount.

This file:

seqio652_lustre2.log

Is "lctl dk" after enabling full debugging and waiting a few seconds.

This file:

seqio542_console_sysrq.log

I did both "l" and "t" sysrq and captured the console output.

Comment by Prakash Surya (Inactive) [ 14/Nov/12 ]

At first glance, this looks like a duplicate of LU-2263.

Comment by Christopher Morrone [ 14/Nov/12 ]

Perhaps related, but in this instance there is no significant lock contention when I log in. CPU usage is very low. My understanding from LU-2263 is that the node has very high load due to lock contention.

Comment by Prakash Surya (Inactive) [ 14/Nov/12 ]

But there definitely was significant lock contention at some point, 68 seconds worth according to the message. Perhaps you caught it and logged in just as it cleared up? Or I may have mis-characterized the "locking up" part in the description of LU-2263.

Either way, it appears to be contending on an (the same?) import lock as in LU-2263:

(gdb) l *ptlrpc_pinger_main+0x19c
0x7ff2c is in ptlrpc_pinger_main (/builddir/build/BUILD/lustre-2.3.54/lustre/ptlrpc/pinger.c:230).
225     /builddir/build/BUILD/lustre-2.3.54/lustre/ptlrpc/pinger.c: No such file or directory.
        in /builddir/build/BUILD/lustre-2.3.54/lustre/ptlrpc/pinger.c

lustre/ptlrpc/pinger.c:

 225 static void ptlrpc_pinger_process_import(struct obd_import *imp,                
 226                                          unsigned long this_ping)               
 227 {                                                                               
 228         int force, level;                                                       
 229                                                                                 
 230         cfs_spin_lock(&imp->imp_lock);                                          
 231         level = imp->imp_state;                                                 
 232         force = imp->imp_force_verify;                                          
 233         if (force)                                                              
 234                 imp->imp_force_verify = 0;                                      
 235         cfs_spin_unlock(&imp->imp_lock);
Comment by Prakash Surya (Inactive) [ 14/Nov/12 ]

If this is deemed a different issue that LU-2263, I probably incorrectly resolved LU-2141.

Comment by Peter Jones [ 15/Nov/12 ]

Alex will triage this one

Comment by Prakash Surya (Inactive) [ 20/Nov/12 ]

I'm looking at the seqio542_console_sysrq.log file..

The mount.lustre task is sleeping:

2012-11-14 14:32:30.029575 {DefaultControlEventListener} [mmcs]{549}.0.0: mount.lustre  D 00000fff7bd0903c     0 20578  20577 0x00000000
2012-11-14 14:32:30.029627 {DefaultControlEventListener} [mmcs]{549}.0.0: Call Trace:
2012-11-14 14:32:30.029679 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db2cc0] [c000000000687f48] svc_rdma_ops+0xa5a0/0x198c0 (unreliable)
2012-11-14 14:32:30.029731 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db2e90] [c000000000008de0] .__switch_to+0xc4/0x100
2012-11-14 14:32:30.029784 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db2f20] [c00000000042b0e0] .schedule+0x858/0x9c0
2012-11-14 14:32:30.029836 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db31d0] [c00000000042c16c] .__mutex_lock_slowpath+0x208/0x390
2012-11-14 14:32:30.029888 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db32d0] [c00000000042c30c] .mutex_lock+0x18/0x34
2012-11-14 14:32:30.029940 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db3350] [8000000003a9ea34] .ptlrpc_pinger_add_import+0x94/0x3c0 [ptlrpc]
2012-11-14 14:32:30.029993 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db3410] [8000000003a35974] .client_connect_import+0x3b4/0x5d0 [ptlrpc]
2012-11-14 14:32:30.030045 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db34f0] [8000000005143ce0] .lov_connect_obd+0x860/0x1590 [lov]
2012-11-14 14:32:30.030097 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db3610] [8000000005144f44] .lov_connect+0x534/0xbb0 [lov]
2012-11-14 14:32:30.030150 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db3750] [80000000069c603c] .ll_fill_super+0x799c/0xace0 [lustre]
2012-11-14 14:32:30.030203 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db3900] [80000000024820c4] .lustre_fill_super+0x4c4/0x8e0 [obdclass]
2012-11-14 14:32:30.030255 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db39d0] [c0000000000d4aa0] .get_sb_nodev+0x84/0xe8
2012-11-14 14:32:30.030307 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db3a80] [800000000245b5a8] .lustre_get_sb+0x28/0x40 [obdclass]
2012-11-14 14:32:30.030360 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db3b10] [c0000000000d3244] .vfs_kern_mount+0x80/0x114
2012-11-14 14:32:30.030412 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db3bc0] [c0000000000d3340] .do_kern_mount+0x58/0x130
2012-11-14 14:32:30.030464 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db3c80] [c0000000000f20fc] .do_mount+0x8c8/0x984
2012-11-14 14:32:30.030516 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db3d70] [c0000000000f2270] .SyS_mount+0xb8/0x124
2012-11-14 14:32:30.030568 {DefaultControlEventListener} [mmcs]{549}.0.0: [c000000387db3e30] [c000000000000580] syscall_exit+0x0/0x2c

The dump of the active CPUs shows two threads stuck on the import lock I detailed above. The "important" one (I think) is below:

2012-11-14 14:32:16.108436 {DefaultControlEventListener} [mmcs]{549}.0.1: SysRq : Show backtrace of all active CPUs
2012-11-14 14:32:16.148433 {DefaultControlEventListener} [mmcs]{549}.8.1: CPU33:
2012-11-14 14:32:16.188544 {DefaultControlEventListener} [mmcs]{549}.8.1: Call Trace:
2012-11-14 14:32:16.228443 {DefaultControlEventListener} [mmcs]{549}.8.1: [c00000000fef3b20] [c000000000008160] .show_stack+0x7c/0x184 (unreliable)
2012-11-14 14:32:16.268527 {DefaultControlEventListener} [mmcs]{549}.8.1: [c00000000fef3bd0] [c000000000275548] .showacpu+0x64/0x94
2012-11-14 14:32:16.308459 {DefaultControlEventListener} [mmcs]{549}.8.1: [c00000000fef3c60] [c000000000068114] .generic_smp_call_function_interrupt+0x11c/0x258
2012-11-14 14:32:16.348385 {DefaultControlEventListener} [mmcs]{549}.8.1: [c00000000fef3d40] [c00000000001c14c] .smp_message_recv+0x34/0x78
2012-11-14 14:32:16.388393 {DefaultControlEventListener} [mmcs]{549}.8.1: [c00000000fef3dc0] [c000000000024250] .bgq_ipi_dispatch+0x118/0x18c
2012-11-14 14:32:16.428248 {DefaultControlEventListener} [mmcs]{549}.8.1: [c00000000fef3e50] [c000000000079d28] .handle_IRQ_event+0x88/0x18c
2012-11-14 14:32:16.472316 {DefaultControlEventListener} [mmcs]{549}.8.1: [c00000000fef3f00] [c00000000007c8a0] .handle_percpu_irq+0x8c/0x100
2012-11-14 14:32:16.518329 {DefaultControlEventListener} [mmcs]{549}.8.1: [c00000000fef3f90] [c00000000001a848] .call_handle_irq+0x1c/0x2c
2012-11-14 14:32:16.558212 {DefaultControlEventListener} [mmcs]{549}.8.1: [c0000003ca60fa80] [c0000000000055a0] .do_IRQ+0x154/0x1e0
2012-11-14 14:32:16.598349 {DefaultControlEventListener} [mmcs]{549}.8.1: [c0000003ca60fb30] [c0000000000134dc] exc_external_input_book3e+0x110/0x114
2012-11-14 14:32:16.638096 {DefaultControlEventListener} [mmcs]{549}.8.1: --- Exception: 501 at ._spin_lock+0x2c/0x44
2012-11-14 14:32:16.678061 {DefaultControlEventListener} [mmcs]{549}.8.1:     LR = .ptlrpc_pinger_main+0x19c/0xcc0 [ptlrpc]
2012-11-14 14:32:16.717910 {DefaultControlEventListener} [mmcs]{549}.8.1: [c0000003ca60fe20] [8000000003a9fea0] .ptlrpc_pinger_main+0x110/0xcc0 [ptlrpc] (unreliable)
2012-11-14 14:32:16.757975 {DefaultControlEventListener} [mmcs]{549}.8.1: [c0000003ca60ff90] [c00000000001a9e0] .kernel_thread+0x54/0x70

So it looks like, one thread is stuck in ptlrpc_pinger_process_import spinning on the imp->imp_lock while still holding the pinger_mutex, as a result mount.lustre gets stuck in ptlrpc_pinger_add_import waiting to obtain the pinger_mutex which is held by the other thread spinning on the import lock.

The question that remains to be answered is, why in the active thread spinning on imp->imp_lock in the first place?

Comment by Christopher Morrone [ 20/Nov/12 ]

I do not see any processes spinning when I run "top", so I don't think we have anyone stuck waiting on a spin lock. I may have just caught it when it happened to be grabbing the lock. Or the stack may not be entirely reliable.

Comment by Prakash Surya (Inactive) [ 20/Nov/12 ]

But.. But.. The console messages.. The kernel was just mistaken about the thread being stuck for so long?

BUG: soft lockup - CPU#33 stuck for 68s! [ll_ping:3244]

top and the console seem to be at odds.. Personally I'd trust the kernel messages, but I can't back up that claim.

Comment by Christopher Morrone [ 20/Nov/12 ]

soft lockups are often not a permanent situation. I don't suspect that ll_ping is permanently stuck. I saw it appear in top occasionally, as you normally would on an otherwise idle system. But I'm pretty sure it is not stuck permanently on a single spin lock.

Comment by Prakash Surya (Inactive) [ 20/Nov/12 ]

Liang, can you please look this one over?

Comment by Liang Zhen (Inactive) [ 21/Nov/12 ]

I think this is different with LU-2263. There are only two threads are spinning on imp_lock, one is ptlrpc_main, the other is ptlrpcd()->ptlrpc_check_set(). imp_lock is a heavy contention lock, but I can't explain why both threads are spinning on it but no thread actually hold it.
I will update you if I found something later.

Comment by Alex Zhuravlev [ 07/Dec/12 ]

Christopher, Prakash, could you translace ptlrpc_check_set+0x4f4 to the source line, please?

Comment by Prakash Surya (Inactive) [ 07/Dec/12 ]

I'm taking this from the description in LU-2263:

(gdb) l *ptlrpc_check_set+0x4f4
0x46b04 is in ptlrpc_check_set (/builddir/build/BUILD/lustre-2.3.54/lustre/ptlrpc/client.c:1852).
1849 >------->------->-------libcfs_nid2str(imp->imp_connection->c_peer.nid),
1850 >------->------->-------lustre_msg_get_opc(req->rq_reqmsg));
1851 
1852                 cfs_spin_lock(&imp->imp_lock);
1853                 /* Request already may be not on sending or delaying list. This
1854                  * may happen in the case of marking it erroneous for the case
1855                  * ptlrpc_import_delay_req(req, status) find it impossible to
Comment by Alex Zhuravlev [ 10/Dec/12 ]

I think this is just another symptom of LU-2263 - the pinger got stuck awaiting for imp_lock and holding pinger_mutex. then mount got stuck awaiting for pinger_mutex.

though one thing in the log doesn't quite fit the theory:
2012-11-14 14:08:38.748069

{DefaultControlEventListener} [mmcs]{549}.8.1: BUG: soft lockup - CPU#33 stuck for 68s! [ll_ping:3244]
...
2012-11-14 14:32:16.108436 {DefaultControlEventListener}

[mmcs]

{549}

.0.1: SysRq : Show backtrace of all active CPUs

so, we had no "soft lockup" messages for ~24min while they are supposed to repeat every ~67s if the thread is making no progress.

also I'd say there should be tiny or zero contention at this point: we're in the middle of mount, the only possible activity is connect and/or ping.

it'd be interesting to have a dump and see details of import/lock.

Comment by Prakash Surya (Inactive) [ 10/Dec/12 ]

We don't have kdump enabled on these systems, but I can try to collect a lustre debug log if that is helpful. Will that contain the information we need (i.e. who is holding the lock)?

Also, will the "soft lockup" message repeat for the same thread? I had thought it would only print once for each thread, I could be wrong though.

Comment by Alex Zhuravlev [ 10/Dec/12 ]

yes, I verified that the message will be printed few times with a simple while(1); in the module.

BUG: soft lockup - CPU#0 stuck for 67s! [osp-syn-0:3547]
...
BUG: soft lockup - CPU#0 stuck for 67s! [osp-syn-0:3547]
...
BUG: soft lockup - CPU#0 stuck for 67s! [osp-syn-0:3547]

btw, was the kernel compiled with CONFIG_DEBUG_SPINLOCK ? this would let us to exclude some basic problems at least (like recursive locks).

Comment by Prakash Surya (Inactive) [ 10/Dec/12 ]

Unfortunately, it does not have CONFIG_DEBUG_SPINLOCK enabled. When trying to enable various kernel config options, I run into boot issues. So far, I've only been able to successfully enable CONFIG_DEBUG_SPINLOCK_SLEEP. I'll look into getting CONFIG_DEBUG_SPINLOCK enabled again, next time we get time on Sequoia.

Comment by Alex Zhuravlev [ 10/Dec/12 ]

please consider CONFIG_LOCK_STAT to collect stats on the locking.

Comment by Alex Zhuravlev [ 12/Dec/12 ]

http://review.whamcloud.com/#change,4808 - the purpose of this debug patch is to make sure this is not ptlrpc_check_set() itself taking too long due to lots of requests.

Comment by Prakash Surya (Inactive) [ 12/Dec/12 ]

Alex, I'll pull that in.

Comment by Christopher Morrone [ 19/Dec/12 ]

Ok, I believe that we're stuck in spin locks now. It turns out that "top" isn't reporting our spinning threads on the ppc64 nodes for some reason. But when I make my terminal tall enough to fit the 68 processors, I see 4 of the 68 processors at 100% usage. Two of the 4 match up with IBM processes that always spin by design (polling their network) on cpus 66 and 67. The other two match up with the cpus that sysrq-l lists as stuck in Lustre code:

2012-12-19 10:46:54.184688 {DefaultControlEventListener} [mmcs]{656}.0.2: --- Exception: 501 at ._spin_lock+0x30/0x44
2012-12-19 10:46:54.185089 {DefaultControlEventListener} [mmcs]{656}.0.2:     LR = .ptlrpc_check_set+0x50c/0x5010 [ptlrpc]
2012-12-19 10:46:54.185599 {DefaultControlEventListener} [mmcs]{656}.0.2: [c0000003e095fb70] [8000000003a19bc4] .ptlrpc_check_set+0x374/0x5010 [ptlrpc] (unreliable)
2012-12-19 10:46:54.186145 {DefaultControlEventListener} [mmcs]{656}.0.2: [c0000003e095fd20] [8000000003a7030c] .ptlrpcd_check+0x66c/0x8a0 [ptlrpc]
2012-12-19 10:46:54.186599 {DefaultControlEventListener} [mmcs]{656}.0.2: [c0000003e095fe40] [8000000003a707f8] .ptlrpcd+0x2b8/0x510 [ptlrpc]
2012-12-19 10:46:54.187056 {DefaultControlEventListener} [mmcs]{656}.0.2: [c0000003e095ff90] [c00000000001b9e0] .kernel_thread+0x54/0x70

and

2012-12-19 10:46:54.193259 {DefaultControlEventListener} [mmcs]{656}.1.1: --- Exception: 501 at ._spin_lock+0x38/0x44
2012-12-19 10:46:54.193759 {DefaultControlEventListener} [mmcs]{656}.1.1:     LR = .ptlrpc_pinger_main+0x19c/0xcc0 [ptlrpc]
2012-12-19 10:46:54.194223 {DefaultControlEventListener} [mmcs]{656}.1.1: [c0000003c18fbe20] [8000000003a52fc0] .ptlrpc_pinger_main+0x110/0xcc0 [ptlrpc] (unreliable)
2012-12-19 10:46:54.194700 {DefaultControlEventListener} [mmcs]{656}.1.1: [c0000003c18fbf90] [c00000000001b9e0] .kernel_thread+0x54/0x70

The former process is ptlrpcd_rcv and the latter is ll_ping.

These are, not at all coincidentally, the two processes for which there were soft lockup reports on the console:

BUG: soft lockup - CPU#2 stuck for 67s! [ptlrpcd_rcv:3361]
BUG: soft lockup - CPU#5 stuck for 67s! [ll_ping:3378]

Alex, we have the 4808 patch applied, but I do not see the message from that on the console.

Comment by Alex Zhuravlev [ 20/Dec/12 ]

ok, thanks for the update. could you verify with gdb this is exactly the same place again?

Comment by Christopher Morrone [ 20/Dec/12 ]

Yes, ll_ping is stuck in the first spin_lock() at the beginning of ptlrpc_pinger_process_import, and ptlrpc_rcv is stuck in the spin_lock() in ptlrpc_check_set() right before the "/* Request already may be not".

The same places that Prakash noted in earlier comments.

Comment by Alex Zhuravlev [ 20/Dec/12 ]

got it, thanks. it would be ideal to try with a debug-enabled kernel.. in the meantime I'll try to cook a local debug patch. sorry, but I still have no any good idea what's happening here.

Comment by Christopher Morrone [ 20/Dec/12 ]

We have about as much kernel debugging enabled as we can. Crash doesn't work on these ppc64 nodes, and the kernel won't boot when many of the kernel debugging options are enabled. We'll add anything else that we can get to work, but I wouldn't plan on getting more.

Comment by Liang Zhen (Inactive) [ 21/Dec/12 ]

this patch fixed a deadlock for ptlrpc: http://review.whamcloud.com/#change,4880 , hope it can be helpful for this too.

Comment by Alex Zhuravlev [ 21/Dec/12 ]

this patch should dump request/import information if the lock can't be acquired in 5s: http://review.whamcloud.com/4881
it would be great to give it a spin along with Liang's patch, if possible.

Comment by Prakash Surya (Inactive) [ 27/Dec/12 ]

Liang, Alex, I pulled in those two patches (with a small tweak to 4881). It will probably be a week or two until we get time on the machine again to test them out.

Comment by Christopher Morrone [ 08/Jan/13 ]

Testing lustre 2.3.58-4chaos, we hit the mount hang.

The console shows:

2013-01-08 13:41:15.901582 {DefaultControlEventListener} [mmcs]{592}.0.2: LustreError: 3261:0:(client.c:1793:ptlrpc_check_set()) @@@ spining..  req@c00000039b36b
400 x1423631018098861/t0(0) o8->lsfull-OST0034-osc-c0000003c1767800@172.20.1.52@o2ib500:28/4 lens 400/264 e 0 to 0 dl 1357681320 ref 1 fl Complete:RN/0/0 rc 0/0
2013-01-08 13:41:15.902369 {DefaultControlEventListener} [mmcs]{592}.0.2: LustreError: 3261:0:(client.c:1797:ptlrpc_check_set()) import c0000003ce3c7000, obd c00
00003ce29fc80, imp_lock = 2147483654
2013-01-08 13:41:15.902795 {DefaultControlEventListener} [mmcs]{592}.0.2: LustreError: 3261:0:(client.c:1800:ptlrpc_check_set()) LBUG
2013-01-08 13:41:15.903241 {DefaultControlEventListener} [mmcs]{592}.0.2: Call Trace:
2013-01-08 13:41:15.903679 {DefaultControlEventListener} [mmcs]{592}.0.2: [c0000003e91af980] [c000000000008d7c] .show_stack+0x7c/0x184 (unreliable)
2013-01-08 13:41:15.904114 {DefaultControlEventListener} [mmcs]{592}.0.2: [c0000003e91afa30] [8000000000ae0cb8] .libcfs_debug_dumpstack+0xd8/0x150 [libcfs]
2013-01-08 13:41:15.904602 {DefaultControlEventListener} [mmcs]{592}.0.2: [c0000003e91afae0] [8000000000ae1480] .lbug_with_loc+0x50/0xc0 [libcfs]
2013-01-08 13:41:15.905036 {DefaultControlEventListener} [mmcs]{592}.0.2: [c0000003e91afb70] [8000000003abaa7c] .ptlrpc_check_set+0x62c/0x50f0 [ptlrpc]
2013-01-08 13:41:15.905469 {DefaultControlEventListener} [mmcs]{592}.0.2: [c0000003e91afd20] [8000000003b115cc] .ptlrpcd_check+0x66c/0x8a0 [ptlrpc]
2013-01-08 13:41:15.905893 {DefaultControlEventListener} [mmcs]{592}.0.2: [c0000003e91afe40] [8000000003b11ab8] .ptlrpcd+0x2b8/0x510 [ptlrpc]
2013-01-08 13:41:15.906341 {DefaultControlEventListener} [mmcs]{592}.0.2: [c0000003e91aff90] [c00000000001b9e0] .kernel_thread+0x54/0x70
2013-01-08 13:41:15.906756 {DefaultControlEventListener} [mmcs]{592}.2.1: LustreError: dumping log to /tmp/lustre-log.1357681275.3261
2013-01-08 13:41:15.907559 {DefaultControlEventListener} [mmcs]{592}.0.1: ^GMessage from syslogd@(none) at Jan  8 13:41:15 ...
2013-01-08 13:41:15.908017 {DefaultControlEventListener} [mmcs]{592}.0.1:  kernel:LustreError: 3261:0:(client.c:1800:ptlrpc_check_set()) LBUG
2013-01-08 13:42:23.488699 {DefaultControlEventListener} [mmcs]{592}.1.2: BUG: soft lockup - CPU#6 stuck for 68s! [ll_ping:3278]
2013-01-08 13:42:23.489613 {DefaultControlEventListener} [mmcs]{592}.1.2: Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(
U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm
2013-01-08 13:42:23.490135 {DefaultControlEventListener} [mmcs]{592}.1.2: NIP: c00000000043972c LR: 8000000003af430c CTR: c0000000004396f8
2013-01-08 13:42:23.490608 {DefaultControlEventListener} [mmcs]{592}.1.2: REGS: c0000003e91f3ba0 TRAP: 0901   Not tainted  (2.6.32-220.23.3.bgq.15llnl.V1R1M2.bgq
62_16.ppc64)
2013-01-08 13:42:23.491118 {DefaultControlEventListener} [mmcs]{592}.1.2: MSR: 0000000080029000 <EE,ME,CE>  CR: 84222484  XER: 20000000
2013-01-08 13:42:23.491631 {DefaultControlEventListener} [mmcs]{592}.1.2: TASK = c0000003e46e6f60[3278] 'll_ping' THREAD: c0000003e91f0000 CPU: 6
2013-01-08 13:42:23.492171 {DefaultControlEventListener} [mmcs]{592}.1.2: GPR00: 0000000080000006 c0000003e91f3e20 c0000000006eece8 c0000003ce3c7278 
2013-01-08 13:42:23.492694 {DefaultControlEventListener} [mmcs]{592}.1.2: GPR04: 2222222222222222 0000000000000000 c0000003e91f3d38 0000000000000000 
2013-01-08 13:42:23.493173 {DefaultControlEventListener} [mmcs]{592}.1.2: GPR08: 0000000000000000 0000000080000006 000000000000000f c0000000004396f8 
2013-01-08 13:42:23.493711 {DefaultControlEventListener} [mmcs]{592}.1.2: GPR12: 8000000003b401a8 c000000000757300 
2013-01-08 13:42:23.494240 {DefaultControlEventListener} [mmcs]{592}.1.2: NIP [c00000000043972c] ._spin_lock+0x34/0x44
2013-01-08 13:42:23.494775 {DefaultControlEventListener} [mmcs]{592}.1.2: LR [8000000003af430c] .ptlrpc_pinger_main+0x19c/0xcc0 [ptlrpc]
2013-01-08 13:42:23.495329 {DefaultControlEventListener} [mmcs]{592}.1.2: Call Trace:
2013-01-08 13:42:23.495845 {DefaultControlEventListener} [mmcs]{592}.1.2: [c0000003e91f3e20] [8000000003af4280] .ptlrpc_pinger_main+0x110/0xcc0 [ptlrpc] (unrelia
ble)
2013-01-08 13:42:23.496386 {DefaultControlEventListener} [mmcs]{592}.1.2: [c0000003e91f3f90] [c00000000001b9e0] .kernel_thread+0x54/0x70
2013-01-08 13:42:23.496898 {DefaultControlEventListener} [mmcs]{592}.1.2: Instruction dump:
2013-01-08 13:42:23.497449 {DefaultControlEventListener} [mmcs]{592}.1.2: 980d0c94 812d0000 7c001829 2c000000 40c20010 7d20192d 40c2fff0 4c00012c 
2013-01-08 13:42:23.497994 {DefaultControlEventListener} [mmcs]{592}.1.2: 2fa00000 4dfe0020 7c210b78 80030000 <2fa00000> 40defff4 7c421378 4bffffc8 
2013-01-08 13:48:58.511240 {DefaultControlEventListener} [mmcs]{592}.14.1: 2013-01-08 13:48:50.540 (INFO ) [0xfffa294f0c0] ibm.cios.sysiod.ClientMonitor: Using large Region=1048576 small =65536 Compile date=Oct 29 2012 Time=18:06:51

See the attached lustre log file "lustre-log.1357681275.3261.txt".

Comment by Alex Zhuravlev [ 14/Jan/13 ]

00000100:00020000:2.0:1357681275.788314:1632:3261:0:(client.c:1797:ptlrpc_check_set()) import c0000003ce3c7000, obd c0000003ce29fc80, imp_lock = 2147483654
(gdb) p/x 2147483654
$1 = 0x80000006
#ifdef CONFIG_PPC64
/* use 0x800000yy when locked, where yy == CPU number */
#define LOCK_TOKEN (*(u32 *)(&get_paca()->lock_token))

so, imp_lock was held by cpu #6, which later was reported with:
2013-01-08 13:42:23.488699

{DefaultControlEventListener}

[mmcs]

{592}

.1.2: BUG: soft lockup - CPU#6 stuck for 68s! [ll_ping:3278]

this mean either CPU#6 is trying to acquire imp_lock recursively (unlikely, the code is easy to follow) or CPU#6 has scheduled out a thread holding imp_lock and is running the pinger.

Comment by Prakash Surya (Inactive) [ 14/Jan/13 ]

I have enabled CONFIG_DEBUG_SPINLOCK on these kernels, but was unable to turn on CONFIG_LOCK_STAT.

Chris, the next time we hit this, can you log into the node and check the console. If we schedule while holding a spin lock, it should print this message:

        printk(KERN_ERR                                                         
                "BUG: sleeping function called from invalid context at %s:%d\n",
                        file, line);                                            
        printk(KERN_ERR                                                         
                "in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n",    
                        in_atomic(), irqs_disabled(),                           
                        current->pid, current->comm);

This should also make it to the logs, but I don't trust the way we filter some messages out.

Comment by Alex Zhuravlev [ 14/Jan/13 ]

oh, great news. thanks, Prakash!

Comment by Andreas Dilger [ 14/Jan/13 ]

I've landed http://review.whamcloud.com/#change,4880 to master.

Comment by Alex Zhuravlev [ 14/Jan/13 ]

Andreas, it seems 4880 didn't help in this case, see comment from Dec 28

Comment by Prakash Surya (Inactive) [ 18/Jan/13 ]

So, during testing yesterday I looped through many mount/unmounts of the file system on each of the Sequoia IONs running 2.3.58-5chaos and didn't run into a single hang. This is confusing because Chris noted that we hit the hang on 2.358-4chaos, and not much has changed between the two tags:

$ git diff 2.3.58-4chaos..2.3.58-5chaos
diff --git a/lustre/lod/lod_qos.c b/lustre/lod/lod_qos.c
index 187f85c..4e9a651 100644
--- a/lustre/lod/lod_qos.c
+++ b/lustre/lod/lod_qos.c
@@ -1264,9 +1264,11 @@ static int lod_qos_parse_config(const struct lu_env *env,
 
        if (magic == __swab32(LOV_USER_MAGIC_V1)) {
                lustre_swab_lov_user_md_v1(v1);
+               magic = v1->lmm_magic;
        } else if (magic == __swab32(LOV_USER_MAGIC_V3)) {
                v3 = buf->lb_buf;
                lustre_swab_lov_user_md_v3(v3);
+               magic = v3->lmm_magic;
        }
 
        if (unlikely(magic != LOV_MAGIC_V1 && magic != LOV_MAGIC_V3)) {

Also, I looked into the kernel sources some more, and contrary to my comment above, I don't think anything will be printed to the console if we schedule while holding a spinlock. The lines I quoted above only print if might_sleep is called, and the message printed from within schedule_debug is only printed if CONFIG_PREEMPT is enabled (which is not on the IONs).

Comment by Christopher Morrone [ 28/Jan/13 ]

Lowered priority on this one. We haven't seen the problem in a couple of weeks. Not entirely sure why/where it could have been fixed, but this is not going to get our attention if we don't see it again.

Comment by Christopher Morrone [ 12/Feb/13 ]

Had some evidence that this problem is still hitting. Of course, the debugging patch itself is fatal (calls LBUG()), so I'm not 100% sure that the problem wasn't just the debug patch itself. I modified the debug patch to print the debug message and drop a backtrace, but not LBUG.

But sadly, I don't think this bug fixed itself.

Comment by Prakash Surya (Inactive) [ 12/Feb/13 ]

I've bumped the priority of this back to "Blocker" status as we've hit this a number of times today, running 2.3.58-12chaos.

We've modified the debug patch a bit, but it's basically shows the same information. A few example messages from the logs:

2013-02-12 16:04:00.355651 {DefaultControlEventListener} [mmcs]{169}.0.1: LustreError: 6708:0:(client.c:1793:ptlrpc_check_set()) @@@ spinning too long...  req@c000000386202000 x1426811847180481/t0(0) o8->lsfull-OST0048-osc-c0000003c6ee5c00@172.20.1.72@o2ib500:28/4 lens 400/264 e 0 to 0 dl 1360713885 ref 1 fl Complete:RN/0/0 rc 0/0
2013-02-12 16:04:00.356429 {DefaultControlEventListener} [mmcs]{169}.0.1: LustreError: 6708:0:(client.c:1797:ptlrpc_check_set()) import c000000383708800, obd c000000358703880, imp_lock = 2147483652
2013-02-12 16:05:42.669084 {DefaultControlEventListener} [mmcs]{535}.0.2: LustreError: 5383:0:(client.c:1793:ptlrpc_check_set()) @@@ spinning too long...  req@c00000039ea4d800 x1426811960426622/t0(0) o8->lsfull-OST0005-osc-c00000038533e000@172.20.1.5@o2ib500:28/4 lens 400/264 e 0 to 0 dl 1360713987 ref 1 fl Complete:RN/0/0 rc 0/0
2013-02-12 16:05:42.669920 {DefaultControlEventListener} [mmcs]{535}.0.2: LustreError: 5383:0:(client.c:1797:ptlrpc_check_set()) import c000000382a4a800, obd c0000003c360c5c0, imp_lock = 2147483655

It's interesting to note the raw_lock.slock value is slightly different in these two messages. It's hex value is: 2147483652 = 0x80000004 != 2147483655 = 0x80000007.

Comment by Alex Zhuravlev [ 14/Feb/13 ]

Christopher, do you think the system get back to "normal" after you removed LBUG() ? or those threads keep spinning on the locks forever?

Prakash, the values different because locks belong to different imports, so they different as different their current owner (CPU#4 and CPU#7).

Comment by Christopher Morrone [ 14/Feb/13 ]

It hit again after removing the LBUG(), so now I can definitively say that the two threads get stuck spinning on a spin lock indefinitely.

Comment by Alex Zhuravlev [ 14/Feb/13 ]

OK, then there should be a list of the backtraces ?

Comment by Christopher Morrone [ 14/Feb/13 ]

Please see the attached files seqio162_console.txt (contains sysrq-l and sysrq-t backtraces) and seqio162_lustre_log.txt.

Comment by Alex Zhuravlev [ 15/Feb/13 ]

ok, thanks... are you still running on the kernel with CONFIG_DEBUG_SPINLOCK ? if so, let's drop that lustre patch and try to see an owner?

Comment by Prakash Surya (Inactive) [ 15/Feb/13 ]

Unfortunately, CONFIG_DEBUG_SPINLOCK was dropped when we upgraded the kernel a couple weeks ago. I need to talk to the admins about why they dropped it, and if it can get enabled again. I'm also interested in finding out who the owner is.

Comment by Christopher Morrone [ 15/Feb/13 ]

Yes, when we went to the RHEL6.3 kernel on Sequoia last Thursday, we lost CONFIG_DEBUG_SPINLOCK. Apparently the kernel would no longer operate with that enabled, as we've seen with other debug settings.

Comment by Alex Zhuravlev [ 20/Feb/13 ]

any chance to get it back ?

Comment by Prakash Surya (Inactive) [ 20/Feb/13 ]

I'm not sure yet, I need to get a test machine to play around with the kernel and determine what the underlying issue is with that option set. Pending work to be done. Is that the best option available to us? I'd imagine knowing what the "owner" field is would tell us a lot.

Comment by Alex Zhuravlev [ 20/Feb/13 ]

well, I've read through the code few times yet. Liang did as well. still no idea how do we get into this.. sorry. I hoped to get a hint from the kernel.

Comment by Prakash Surya (Inactive) [ 21/Feb/13 ]

Alex, we've re-enabled CONFIG_DEBUG_SPINLOCK on the ION kernels. I've added a patch to print out the void *owner, int magic, and int owner_cpu fields of the spinlock_t when the "spinning" message is printed. Is there any other information I should print when this hits? Or any other debugging info that would be useful?

Comment by Prakash Surya (Inactive) [ 26/Feb/13 ]

I've tried reproducing the hang with CONFIG_DEBUG_SPINLOCK enabled, but haven't been able to trigger it.

Comment by Jodi Levi (Inactive) [ 12/Mar/13 ]

Have you been able to reproduce this issue? Or should we close this ticket and reopen if it occurs again?

Comment by Christopher Morrone [ 12/Mar/13 ]

We seem to have the problem that enabling kernel spin lock debugging makes the problem go away. I think that explains why we had a period of time before that the problem went away as well.

So I would say that the ticket should remain open, because there seems to be a lustre race here somewhere. But the priority can be dropped in favor of other things that are more pressing.

Comment by Christopher Morrone [ 06/Jun/13 ]

This is still a problem. Unfortunately, we never seem to hit the problem while CONFIG_DEBUG_SPINLOCK is enabled, but it hits fairly frequently with it disabled.

In the latest occurance that I investigated yesterday, I see the following stacks:

CPU1:

ptlrpcd_check_set
ptlrpcd_check
ptlrpcd

It appears to be stuck spinning on the "while (spin_trylock (&imp->imp_lock)" line.

CPU5:

ptlrpc_pinger_process_import (inlined)
ptlrpc_pinger_main

It is spinning on "spin_lock(&imp->imp_lock)" in ptlrpc_pinger_process_import(), which holding the pinger mutex.

Not spinning, but stuck sleeping on the pinger mutex is the mount.lustre process:

schedule
__mutex_lock_slowpath
mutex_lock
ptlrpc_pinger_add_import
client_connect_import
lov_connect_obd
lov_connect
ll_fill_super
lustre_fill_super
get_sb_nodev
lustre_get_sb
vfs_kern_mount
do_kern_mount
do_mount
sys_mount
syscall_exit

But mount.lustre waiting on the mutex is understood since the other thread is holding it. We still need to understand who is holding the import spin lock.

Looking our debugging statement ("spinning too long"), I see that the value of the spin lock is 2147483652. In hex, that is:

0x8000004

Looking at the Linux kernel PowerPC spin lock code, I see that when lock is set, the least significant byte is set to the CPU number. So the lock value looks correct, and apparently the lock was taken on a CPU that is not currently in use by any processes.

So that is just seems to further support the assumption that Lustre is failing to unlock the imp_lock somewhere.

Generated at Sat Feb 10 01:24:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.