[LU-2597] Soft lockup messages on Sequoia in ptlrpc_check_set under ll_fsync Created: 09/Jan/13 Updated: 19/Feb/13 Resolved: 19/Feb/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Prakash Surya (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | MB, sequoia, topsequoia | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 6051 | ||||||||
| Description |
|
This looks similar to When running a large IOR using our 2.3.58-3chaos tag, we saw many CPU Soft Lockup messages on the console of many clients. All of the stacks I have been able to uncover look like this: NIP [c000000000439724] ._spin_lock+0x2c/0x44 LR [8000000003a8e81c] .ptlrpc_check_set+0x43cc/0x5120 [ptlrpc] Call Trace: [c0000003ec7bf600] [8000000003a8e75c] .ptlrpc_check_set+0x430c/0x5120 [ptlrpc] (unreliable) [c0000003ec7bf7b0] [8000000003a8fa5c] .ptlrpc_set_wait+0x4ec/0xcb0 [ptlrpc] [c0000003ec7bf920] [8000000003a90974] .ptlrpc_queue_wait+0xd4/0x380 [ptlrpc] [c0000003ec7bf9e0] [80000000059c3564] .mdc_sync+0x104/0x340 [mdc] [c0000003ec7bfa90] [800000000709bf08] .lmv_sync+0x2c8/0x820 [lmv] [c0000003ec7bfb80] [80000000068232dc] .ll_fsync+0x23c/0xc50 [lustre] [c0000003ec7bfc80] [c000000000100420] .vfs_fsync_range+0xb0/0x104 [c0000003ec7bfd30] [c000000000100518] .do_fsync+0x3c/0x6c [c0000003ec7bfdc0] [c000000000100588] .SyS_fsync+0x18/0x28 [c0000003ec7bfe30] [c000000000000580] syscall_exit+0x0/0x2c Searching through all of the console logs for that day, I found three unique address within ptlrpc_check_set listed: $ grep -E "LR.*ptlrpc_check_set" R* | grep 2013-01-08 | \ sed 's/.*\(ptlrpc_check_set+[a-z0-9]\+\).*/\1/' | \ sort | uniq ptlrpc_check_set+0x43cc ptlrpc_check_set+0x50c ptlrpc_check_set+0xeec Listing these addresses is a bit confusing to me, they aren't near any spin_lock calls as I expected. Rather, they seem to correlate to debug print statements. ptlrpc_check_set+0x43cc (gdb) l *ptlrpc_check_set+0x43cc
0x4e81c is in ptlrpc_check_set (/builddir/build/BUILD/lustre-2.3.58/lustre/ptlrpc/client.c:1113).
1109 static int ptlrpc_check_status(struct ptlrpc_request *req)
1110 {
1111 int err;
1112 ENTRY;
1113
1114 err = lustre_msg_get_status(req->rq_repmsg);
1115 if (lustre_msg_get_type(req->rq_repmsg) == PTL_RPC_MSG_ERR) {
1116 struct obd_import *imp = req->rq_import;
1117 __u32 opc = lustre_msg_get_opc(req->rq_reqmsg);
ptlrpc_check_set+0x50c (gdb) l *ptlrpc_check_set+0x50c 0x4a95c is in ptlrpc_check_set (/builddir/build/BUILD/lustre-2.3.58/lustre/ptlrpc/client.c:1774). 1771 ptlrpc_rqphase_move(req, RQ_PHASE_COMPLETE); 1772 1773 CDEBUG(req->rq_reqmsg != NULL ? D_RPCTRACE : 0, 1774 "Completed RPC pname:cluuid:pid:xid:nid:" 1775 "opc %s:%s:%d:"LPU64":%s:%d\n", 1776 cfs_curproc_comm(), imp->imp_obd->obd_uuid.uuid, 1777 lustre_msg_get_status(req->rq_reqmsg), req->rq_xid, Any ideas? I'm working on getting CONFIG_DEBUG_SPINLOCK and CONFIG_LOCK_STAT enabled on the IONs. I'm hopeful that will work and will allow me to determine which spin lock the threads are contending on. |
| Comments |
| Comment by Peter Jones [ 10/Jan/13 ] |
|
Alex Any ideas? Peter |
| Comment by Alex Zhuravlev [ 15/Jan/13 ] |
|
hmm, but the first trace is pretty specific about spin_lock() ? |
| Comment by Prakash Surya (Inactive) [ 15/Jan/13 ] |
|
Hence, the confusion. All of the stacks I looked at ended in a spin_lock call, but I don't immediately see that when looking at the source lines gdb gives me. |
| Comment by Alex Zhuravlev [ 15/Jan/13 ] |
|
well, instructions can be reordered or removed due to optimizations. |
| Comment by Andreas Dilger [ 15/Jan/13 ] |
|
This does appear to be related to |
| Comment by nasf (Inactive) [ 23/Jan/13 ] |
|
How many CPU processors on the client? The same configuration as |
| Comment by Prakash Surya (Inactive) [ 23/Jan/13 ] |
|
Yes, the nodes are the same hardware and software configuration as |
| Comment by nasf (Inactive) [ 16/Feb/13 ] |
|
Have we still met the issues after |
| Comment by Prakash Surya (Inactive) [ 19/Feb/13 ] |
|
Not that I can recall, I'm OK closing it. It can always be reopened if it comes up again in the future. Chris, do you agree? |
| Comment by Christopher Morrone [ 19/Feb/13 ] |
|
I am fine with closing this. |
| Comment by Jodi Levi (Inactive) [ 19/Feb/13 ] |
|
Closing per comments in ticket. |