Details
Description
We have another instance of clients getting stuck at mount time. Out of 768 client, 766 mounted successfully, and 2 appear to be stuck. So far it has been several minutes. They both had the same soft lockup messages on the console:
LustreError: 20578:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2 BUG: soft lockup - CPU#33 stuck for 68s! [ll_ping:3244] Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm NIP: c00000000042e190 LR: 8000000003a9ff2c CTR: c00000000042e160 REGS: c0000003ca60fba0 TRAP: 0901 Not tainted (2.6.32-220.23.3.bgq.13llnl.V1R1M2.bgq62_16.ppc64) MSR: 0000000080029000 <EE,ME,CE> CR: 84228424 XER: 20000000 TASK = c0000003ca51e600[3244] 'll_ping' THREAD: c0000003ca60c000 CPU: 33 GPR00: 0000000080000021 c0000003ca60fe20 c0000000006de510 c0000003c00c0a78 GPR04: 2222222222222222 0000000000000000 c0000003ca60fd38 0000000000000000 GPR08: 0000000000000000 0000000080000021 0000000000000000 c00000000042e160 GPR12: 8000000003aebda8 c00000000075d200 NIP [c00000000042e190] ._spin_lock+0x30/0x44 LR [8000000003a9ff2c] .ptlrpc_pinger_main+0x19c/0xcc0 [ptlrpc] Call Trace: [c0000003ca60fe20] [8000000003a9fea0] .ptlrpc_pinger_main+0x110/0xcc0 [ptlrpc] (unreliable) [c0000003ca60ff90] [c00000000001a9e0] .kernel_thread+0x54/0x70 Instruction dump: 38000000 980d0c94 812d0000 7c001829 2c000000 40c20010 7d20192d 40c2fff0 4c00012c 2fa00000 4dfe0020 7c210b78 <80030000> 2fa00000 40defff4 7c421378 BUG: soft lockup - CPU#49 stuck for 68s! [ptlrpcd_rcv:3175] Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm NIP: c00000000042e198 LR: 8000000003a66af4 CTR: c00000000042e160 REGS: c0000003e9bf3900 TRAP: 0901 Not tainted (2.6.32-220.23.3.bgq.13llnl.V1R1M2.bgq62_16.ppc64) MSR: 0000000080029000 <EE,ME,CE> CR: 84228444 XER: 20000000 TASK = c0000003ea1f0740[3175] 'ptlrpcd_rcv' THREAD: c0000003e9bf0000 CPU: 49 GPR00: 0000000080000021 c0000003e9bf3b80 c0000000006de510 c0000003c00c0a78 GPR04: 0000000000000001 0000000000000000 0000000000000000 0000000000000000 GPR08: c0000003c00c0a60 0000000080000031 c00000036a060900 c00000000042e160 GPR12: 8000000003aebda8 c00000000076a200 NIP [c00000000042e198] ._spin_lock+0x38/0x44 LR [8000000003a66af4] .ptlrpc_check_set+0x4f4/0x4e80 [ptlrpc] Call Trace: [c0000003e9bf3b80] [8000000003a66964] .ptlrpc_check_set+0x364/0x4e80 [ptlrpc] (unreliable) [c0000003e9bf3d20] [8000000003abd1cc] .ptlrpcd_check+0x66c/0x8a0 [ptlrpc] [c0000003e9bf3e40] [8000000003abd6b8] .ptlrpcd+0x2b8/0x510 [ptlrpc] [c0000003e9bf3f90] [c00000000001a9e0] .kernel_thread+0x54/0x70 Instruction dump: 812d0000 7c001829 2c000000 40c20010 7d20192d 40c2fff0 4c00012c 2fa00000 4dfe0020 7c210b78 80030000 2fa00000 <40defff4> 7c421378 4bffffc8 7c0802a6
Both nodes are pretty much idle. I'll see what other information I can gather.
Attachments
Issue Links
- is related to
-
LU-2572 Test LU-2263 and LU-2327 patches on Sequoia
-
- Resolved
-
This is still a problem. Unfortunately, we never seem to hit the problem while CONFIG_DEBUG_SPINLOCK is enabled, but it hits fairly frequently with it disabled.
In the latest occurance that I investigated yesterday, I see the following stacks:
CPU1:
It appears to be stuck spinning on the "while (spin_trylock (&imp->imp_lock)" line.
CPU5:
It is spinning on "spin_lock(&imp->imp_lock)" in ptlrpc_pinger_process_import(), which holding the pinger mutex.
Not spinning, but stuck sleeping on the pinger mutex is the mount.lustre process:
But mount.lustre waiting on the mutex is understood since the other thread is holding it. We still need to understand who is holding the import spin lock.
Looking our debugging statement ("spinning too long"), I see that the value of the spin lock is 2147483652. In hex, that is:
Looking at the Linux kernel PowerPC spin lock code, I see that when lock is set, the least significant byte is set to the CPU number. So the lock value looks correct, and apparently the lock was taken on a CPU that is not currently in use by any processes.
So that is just seems to further support the assumption that Lustre is failing to unlock the imp_lock somewhere.