Details
Description
We have another instance of clients getting stuck at mount time. Out of 768 client, 766 mounted successfully, and 2 appear to be stuck. So far it has been several minutes. They both had the same soft lockup messages on the console:
LustreError: 20578:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2 BUG: soft lockup - CPU#33 stuck for 68s! [ll_ping:3244] Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm NIP: c00000000042e190 LR: 8000000003a9ff2c CTR: c00000000042e160 REGS: c0000003ca60fba0 TRAP: 0901 Not tainted (2.6.32-220.23.3.bgq.13llnl.V1R1M2.bgq62_16.ppc64) MSR: 0000000080029000 <EE,ME,CE> CR: 84228424 XER: 20000000 TASK = c0000003ca51e600[3244] 'll_ping' THREAD: c0000003ca60c000 CPU: 33 GPR00: 0000000080000021 c0000003ca60fe20 c0000000006de510 c0000003c00c0a78 GPR04: 2222222222222222 0000000000000000 c0000003ca60fd38 0000000000000000 GPR08: 0000000000000000 0000000080000021 0000000000000000 c00000000042e160 GPR12: 8000000003aebda8 c00000000075d200 NIP [c00000000042e190] ._spin_lock+0x30/0x44 LR [8000000003a9ff2c] .ptlrpc_pinger_main+0x19c/0xcc0 [ptlrpc] Call Trace: [c0000003ca60fe20] [8000000003a9fea0] .ptlrpc_pinger_main+0x110/0xcc0 [ptlrpc] (unreliable) [c0000003ca60ff90] [c00000000001a9e0] .kernel_thread+0x54/0x70 Instruction dump: 38000000 980d0c94 812d0000 7c001829 2c000000 40c20010 7d20192d 40c2fff0 4c00012c 2fa00000 4dfe0020 7c210b78 <80030000> 2fa00000 40defff4 7c421378 BUG: soft lockup - CPU#49 stuck for 68s! [ptlrpcd_rcv:3175] Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm NIP: c00000000042e198 LR: 8000000003a66af4 CTR: c00000000042e160 REGS: c0000003e9bf3900 TRAP: 0901 Not tainted (2.6.32-220.23.3.bgq.13llnl.V1R1M2.bgq62_16.ppc64) MSR: 0000000080029000 <EE,ME,CE> CR: 84228444 XER: 20000000 TASK = c0000003ea1f0740[3175] 'ptlrpcd_rcv' THREAD: c0000003e9bf0000 CPU: 49 GPR00: 0000000080000021 c0000003e9bf3b80 c0000000006de510 c0000003c00c0a78 GPR04: 0000000000000001 0000000000000000 0000000000000000 0000000000000000 GPR08: c0000003c00c0a60 0000000080000031 c00000036a060900 c00000000042e160 GPR12: 8000000003aebda8 c00000000076a200 NIP [c00000000042e198] ._spin_lock+0x38/0x44 LR [8000000003a66af4] .ptlrpc_check_set+0x4f4/0x4e80 [ptlrpc] Call Trace: [c0000003e9bf3b80] [8000000003a66964] .ptlrpc_check_set+0x364/0x4e80 [ptlrpc] (unreliable) [c0000003e9bf3d20] [8000000003abd1cc] .ptlrpcd_check+0x66c/0x8a0 [ptlrpc] [c0000003e9bf3e40] [8000000003abd6b8] .ptlrpcd+0x2b8/0x510 [ptlrpc] [c0000003e9bf3f90] [c00000000001a9e0] .kernel_thread+0x54/0x70 Instruction dump: 812d0000 7c001829 2c000000 40c20010 7d20192d 40c2fff0 4c00012c 2fa00000 4dfe0020 7c210b78 80030000 2fa00000 <40defff4> 7c421378 4bffffc8 7c0802a6
Both nodes are pretty much idle. I'll see what other information I can gather.
Attachments
Issue Links
- is related to
-
LU-2572 Test LU-2263 and LU-2327 patches on Sequoia
- Resolved
There are no Sub-Tasks for this issue.