Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.4.0
Labels:
- llnl
- ptr
Environment:
Sequoia, 2.3.54-2chaos, github.com/chaos/lustre

Severity:
3
Rank (Obsolete):
5558

Description

We have another instance of clients getting stuck at mount time. Out of 768 client, 766 mounted successfully, and 2 appear to be stuck. So far it has been several minutes. They both had the same soft lockup messages on the console:

LustreError: 20578:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2
BUG: soft lockup - CPU#33 stuck for 68s! [ll_ping:3244]
Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm
NIP: c00000000042e190 LR: 8000000003a9ff2c CTR: c00000000042e160
REGS: c0000003ca60fba0 TRAP: 0901   Not tainted  (2.6.32-220.23.3.bgq.13llnl.V1R1M2.bgq62_16.ppc64)
MSR: 0000000080029000 <EE,ME,CE>  CR: 84228424  XER: 20000000
TASK = c0000003ca51e600[3244] 'll_ping' THREAD: c0000003ca60c000 CPU: 33
GPR00: 0000000080000021 c0000003ca60fe20 c0000000006de510 c0000003c00c0a78
GPR04: 2222222222222222 0000000000000000 c0000003ca60fd38 0000000000000000
GPR08: 0000000000000000 0000000080000021 0000000000000000 c00000000042e160
GPR12: 8000000003aebda8 c00000000075d200
NIP [c00000000042e190] ._spin_lock+0x30/0x44
LR [8000000003a9ff2c] .ptlrpc_pinger_main+0x19c/0xcc0 [ptlrpc]
Call Trace:
[c0000003ca60fe20] [8000000003a9fea0] .ptlrpc_pinger_main+0x110/0xcc0 [ptlrpc] (unreliable)
[c0000003ca60ff90] [c00000000001a9e0] .kernel_thread+0x54/0x70
Instruction dump:
38000000 980d0c94 812d0000 7c001829 2c000000 40c20010 7d20192d 40c2fff0
4c00012c 2fa00000 4dfe0020 7c210b78 <80030000> 2fa00000 40defff4 7c421378
BUG: soft lockup - CPU#49 stuck for 68s! [ptlrpcd_rcv:3175]
Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm
NIP: c00000000042e198 LR: 8000000003a66af4 CTR: c00000000042e160
REGS: c0000003e9bf3900 TRAP: 0901   Not tainted  (2.6.32-220.23.3.bgq.13llnl.V1R1M2.bgq62_16.ppc64)
MSR: 0000000080029000 <EE,ME,CE>  CR: 84228444  XER: 20000000
TASK = c0000003ea1f0740[3175] 'ptlrpcd_rcv' THREAD: c0000003e9bf0000 CPU: 49
GPR00: 0000000080000021 c0000003e9bf3b80 c0000000006de510 c0000003c00c0a78
GPR04: 0000000000000001 0000000000000000 0000000000000000 0000000000000000
GPR08: c0000003c00c0a60 0000000080000031 c00000036a060900 c00000000042e160
GPR12: 8000000003aebda8 c00000000076a200
NIP [c00000000042e198] ._spin_lock+0x38/0x44
LR [8000000003a66af4] .ptlrpc_check_set+0x4f4/0x4e80 [ptlrpc]
Call Trace:
[c0000003e9bf3b80] [8000000003a66964] .ptlrpc_check_set+0x364/0x4e80 [ptlrpc] (unreliable)
[c0000003e9bf3d20] [8000000003abd1cc] .ptlrpcd_check+0x66c/0x8a0 [ptlrpc]
[c0000003e9bf3e40] [8000000003abd6b8] .ptlrpcd+0x2b8/0x510 [ptlrpc]
[c0000003e9bf3f90] [c00000000001a9e0] .kernel_thread+0x54/0x70
Instruction dump:
812d0000 7c001829 2c000000 40c20010 7d20192d 40c2fff0 4c00012c 2fa00000
4dfe0020 7c210b78 80030000 2fa00000 <40defff4> 7c421378 4bffffc8 7c0802a6

Both nodes are pretty much idle. I'll see what other information I can gather.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lustre-log.1357681275.3261.txt
3.02 MB
08/Jan/13 4:52 PM
seqio162_console.txt
1.58 MB
14/Feb/13 5:16 PM
seqio162_lustre_log.txt
3.06 MB
14/Feb/13 5:16 PM
seqio542_console_sysrq.log
1.66 MB
14/Nov/12 5:50 PM
seqio542_lustre.log
7.99 MB
14/Nov/12 5:50 PM
seqio652_lustre.log
8.00 MB
14/Nov/12 5:50 PM
seqio652_lustre2.log
4.02 MB
14/Nov/12 5:50 PM

Issue Links

is related to

LU-2572 Test LU-2263 and LU-2327 patches on Sequoia

Resolved

Sub-Tasks

Progress

Test LU-2263 and LU-2327 patches on Sequoia

Resolved

Prakash Surya (Inactive)

Activity

People

Assignee:: Alex Zhuravlev

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 14/Nov/12 5:24 PM

Updated:: 15/Apr/20 5:47 PM

Resolved:: 15/Apr/20 5:47 PM