Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
None
-
Lustre 2.4.0
-
3
-
6051
Description
This looks similar to LU-2263, but looking at the stacks and function address listings I'm not certain they are the same just yet.
When running a large IOR using our 2.3.58-3chaos tag, we saw many CPU Soft Lockup messages on the console of many clients. All of the stacks I have been able to uncover look like this:
NIP [c000000000439724] ._spin_lock+0x2c/0x44 LR [8000000003a8e81c] .ptlrpc_check_set+0x43cc/0x5120 [ptlrpc] Call Trace: [c0000003ec7bf600] [8000000003a8e75c] .ptlrpc_check_set+0x430c/0x5120 [ptlrpc] (unreliable) [c0000003ec7bf7b0] [8000000003a8fa5c] .ptlrpc_set_wait+0x4ec/0xcb0 [ptlrpc] [c0000003ec7bf920] [8000000003a90974] .ptlrpc_queue_wait+0xd4/0x380 [ptlrpc] [c0000003ec7bf9e0] [80000000059c3564] .mdc_sync+0x104/0x340 [mdc] [c0000003ec7bfa90] [800000000709bf08] .lmv_sync+0x2c8/0x820 [lmv] [c0000003ec7bfb80] [80000000068232dc] .ll_fsync+0x23c/0xc50 [lustre] [c0000003ec7bfc80] [c000000000100420] .vfs_fsync_range+0xb0/0x104 [c0000003ec7bfd30] [c000000000100518] .do_fsync+0x3c/0x6c [c0000003ec7bfdc0] [c000000000100588] .SyS_fsync+0x18/0x28 [c0000003ec7bfe30] [c000000000000580] syscall_exit+0x0/0x2c
Searching through all of the console logs for that day, I found three unique address within ptlrpc_check_set listed:
$ grep -E "LR.*ptlrpc_check_set" R* | grep 2013-01-08 | \ sed 's/.*\(ptlrpc_check_set+[a-z0-9]\+\).*/\1/' | \ sort | uniq ptlrpc_check_set+0x43cc ptlrpc_check_set+0x50c ptlrpc_check_set+0xeec
Listing these addresses is a bit confusing to me, they aren't near any spin_lock calls as I expected. Rather, they seem to correlate to debug print statements.
(gdb) l *ptlrpc_check_set+0x43cc 0x4e81c is in ptlrpc_check_set (/builddir/build/BUILD/lustre-2.3.58/lustre/ptlrpc/client.c:1113). 1109 static int ptlrpc_check_status(struct ptlrpc_request *req) 1110 { 1111 int err; 1112 ENTRY; 1113 1114 err = lustre_msg_get_status(req->rq_repmsg); 1115 if (lustre_msg_get_type(req->rq_repmsg) == PTL_RPC_MSG_ERR) { 1116 struct obd_import *imp = req->rq_import; 1117 __u32 opc = lustre_msg_get_opc(req->rq_reqmsg);
(gdb) l *ptlrpc_check_set+0x50c 0x4a95c is in ptlrpc_check_set (/builddir/build/BUILD/lustre-2.3.58/lustre/ptlrpc/client.c:1774). 1771 ptlrpc_rqphase_move(req, RQ_PHASE_COMPLETE); 1772 1773 CDEBUG(req->rq_reqmsg != NULL ? D_RPCTRACE : 0, 1774 "Completed RPC pname:cluuid:pid:xid:nid:" 1775 "opc %s:%s:%d:"LPU64":%s:%d\n", 1776 cfs_curproc_comm(), imp->imp_obd->obd_uuid.uuid, 1777 lustre_msg_get_status(req->rq_reqmsg), req->rq_xid,
Any ideas?
I'm working on getting CONFIG_DEBUG_SPINLOCK and CONFIG_LOCK_STAT enabled on the IONs. I'm hopeful that will work and will allow me to determine which spin lock the threads are contending on.
Attachments
Issue Links
- is related to
-
LU-2263 CPU Soft Lockups due to many threads spinning on import lock on Sequoia IO nodes
- Resolved