Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.4.0
-
3
-
5413
Description
The IO nodes on Sequoia are going unresponsive with many CPU soft lockup messages in the logs. For example, a couple messages from a node in this state are as follows:
2012-11-01 23:44:47.681648 {DefaultControlEventListener} [mmcs]{161}.0.2: BUG: soft lockup - CPU#2 stuck for 67s! [ptlrpcd_56:3326] 2012-11-01 23:44:47.720983 {DefaultControlEventListener} [mmcs]{161}.0.2: Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm 2012-11-01 23:44:47.760955 {DefaultControlEventListener} [mmcs]{161}.0.2: NIP: c00000000042e194 LR: 8000000003a674bc CTR: c00000000042e160 2012-11-01 23:44:47.801187 {DefaultControlEventListener} [mmcs]{161}.0.2: REGS: c0000003ca2e7900 TRAP: 0901 Not tainted (2.6.32-220.23.3.bgq.13llnl.V1R1M2.bgq62_16.ppc64) 2012-11-01 23:44:47.841287 {DefaultControlEventListener} [mmcs]{161}.0.2: MSR: 0000000080029000 <EE,ME,CE> CR: 84284444 XER: 20000000 2012-11-01 23:44:47.880979 {DefaultControlEventListener} [mmcs]{161}.0.2: TASK = c0000003ca1c7980[3326] 'ptlrpcd_56' THREAD: c0000003ca2e4000 CPU: 2 2012-11-01 23:44:47.921379 {DefaultControlEventListener} [mmcs]{161}.0.2: GPR00: 0000000080000038 c0000003ca2e7b80 c0000000006de510 c0000003ceca6278 2012-11-01 23:44:47.960900 {DefaultControlEventListener} [mmcs]{161}.0.2: GPR04: 0000000000000001 0000000000000000 0000000000000000 0000000000000000 2012-11-01 23:44:48.000948 {DefaultControlEventListener} [mmcs]{161}.0.2: GPR08: 0000000000200200 0000000080000002 c0000003cefddc68 c00000000042e160 2012-11-01 23:44:48.105877 {DefaultControlEventListener} [mmcs]{161}.0.2: GPR12: 8000000003aebdb8 c000000000743f00 2012-11-01 23:44:48.106188 {DefaultControlEventListener} [mmcs]{161}.0.2: NIP [c00000000042e194] ._spin_lock+0x34/0x44 2012-11-01 23:44:48.121006 {DefaultControlEventListener} [mmcs]{161}.0.2: LR [8000000003a674bc] .ptlrpc_check_set+0xeac/0x4e80 [ptlrpc] 2012-11-01 23:44:48.161100 {DefaultControlEventListener} [mmcs]{161}.0.2: Call Trace: 2012-11-01 23:44:48.201258 {DefaultControlEventListener} [mmcs]{161}.0.2: [c0000003ca2e7b80] [8000000003a674ac] .ptlrpc_check_set+0xe9c/0x4e80 [ptlrpc] (unreliable) 2012-11-01 23:44:48.240907 {DefaultControlEventListener} [mmcs]{161}.0.2: [c0000003ca2e7d20] [8000000003abd1dc] .ptlrpcd_check+0x66c/0x8a0 [ptlrpc] 2012-11-01 23:44:48.280969 {DefaultControlEventListener} [mmcs]{161}.0.2:
The line it's stuck on:
(gdb) l *ptlrpc_check_set+0xeac 0x474bc is in ptlrpc_check_set (/builddir/build/BUILD/lustre-2.3.54/lustre/ptlrpc/client.c:1666).
And the source level info:
1663 if (!ptlrpc_unregister_reply(req, 1)) 1664 continue; 1665 1666 cfs_spin_lock(&imp->imp_lock); 1667 if (ptlrpc_import_delay_req(imp, req, &status)){ 1668 /* put on delay list - only if we wait 1669 * recovery finished - before send */
Details of the next message on the console:
BUG: soft lockup - CPU#16 stuck for 67s! [ptlrpcd_rcv:3269] 2012-11-01 23:44:48.320953 {DefaultControlEventListener} [mmcs]{161}.0.2: Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm 2012-11-01 23:44:48.361045 {DefaultControlEventListener} [mmcs]{161}.0.2: NIP: c00000000042e18c LR: 8000000003a66b04 CTR: c00000000042e160 2012-11-01 23:44:48.400942 {DefaultControlEventListener} [mmcs]{161}.0.2: REGS: c0000003ca203900 TRAP: 0901 Not tainted (2.6.32-220.23.3.bgq.13llnl.V1R1M2.bgq62_16.ppc64) 2012-11-01 23:44:48.440963 {DefaultControlEventListener} [mmcs]{161}.0.2: MSR: 0000000080029000 <EE,ME,CE> CR: 84228484 XER: 20000000 2012-11-01 23:44:48.480905 {DefaultControlEventListener} [mmcs]{161}.0.2: TASK = c0000003c5d432a0[3269] 'ptlrpcd_rcv' THREAD: c0000003ca200000 CPU: 16 2012-11-01 23:44:48.520898 {DefaultControlEventListener} [mmcs]{161}.0.2: GPR00: 0000000080000038 c0000003ca203b80 c0000000006de510 c0000003ceca6278 2012-11-01 23:44:48.560925 {DefaultControlEventListener} [mmcs]{161}.0.2: GPR04: 0000000000000001 0000000000000000 0000000000000000 0000000000000000 2012-11-01 23:44:48.600920 {DefaultControlEventListener} [mmcs]{161}.0.2: GPR08: c0000003ceca6260 0000000080000010 c0000003ce673420 c00000000042e160 2012-11-01 23:44:48.640881 {DefaultControlEventListener} [mmcs]{161}.0.2: GPR12: 8000000003aebdb8 c00000000074f500 2012-11-01 23:44:48.680904 {DefaultControlEventListener} [mmcs]{161}.0.2: NIP [c00000000042e18c] ._spin_lock+0x2c/0x44 2012-11-01 23:44:48.720885 {DefaultControlEventListener} [mmcs]{161}.0.2: LR [8000000003a66b04] .ptlrpc_check_set+0x4f4/0x4e80 [ptlrpc] 2012-11-01 23:44:48.761372 {DefaultControlEventListener} [mmcs]{161}.0.2: Call Trace: 2012-11-01 23:44:48.800923 {DefaultControlEventListener} [mmcs]{161}.0.2: [c0000003ca203b80] [8000000003a66974] .ptlrpc_check_set+0x364/0x4e80 [ptlrpc] (unreliable) 2012-11-01 23:44:48.840892 {DefaultControlEventListener} [mmcs]{161}.0.2: [c0000003ca203d20] [8000000003abd1dc] .ptlrpcd_check+0x66c/0x8a0 [ptlrpc] 2012-11-01 23:44:48.880936 {DefaultControlEventListener} [mmcs]{161}.0.2: [c0000003ca203e40] [8000000003abd76c] .ptlrpcd+0x35c/0x510 [ptlrpc] 2012-11-01 23:44:48.920887 {DefaultControlEventListener} [mmcs]{161}.0.2: [c0000003ca203f90] [c00000000001a9e0] .kernel_thread+0x54/0x70 2012-11-01 23:44:48.961113 {DefaultControlEventListener} [mmcs]{161}.0.2: Instruction dump: 2012-11-01 23:44:49.000886 {DefaultControlEventListener} [mmcs]{161}.0.2: 4bffffc8 38000000 980d0c94 812d0000 7c001829 2c000000 40c20010 7d20192d 2012-11-01 23:44:49.040902 {DefaultControlEventListener} [mmcs]{161}.0.2: 40c2fff0 4c00012c 2fa00000 4dfe0020 <7c210b78> 80030000 2fa00000 40defff4 2012-11-01 23:44:49.121348 {DefaultControlEventListener} [mmcs]{161}.0.2: [c0000003ca2e7e40] [8000000003abd76c] .ptlrpcd+0x35c/0x510 [ptlrpc] 2012-11-01 23:44:49.160915 {DefaultControlEventListener} [mmcs]{161}.0.2: [c0000003ca2e7f90] [c00000000001a9e0] .kernel_thread+0x54/0x70 2012-11-01 23:44:49.201038 {DefaultControlEventListener} [mmcs]{161}.0.2: Instruction dump: 2012-11-01 23:44:49.241229 {DefaultControlEventListener} [mmcs]{161}.0.2: 980d0c94 812d0000 7c001829 2c000000 40c20010 7d20192d 40c2fff0 4c00012c 2012-11-01 23:44:49.281578 {DefaultControlEventListener} [mmcs]{161}.0.2: 2fa00000 4dfe0020 7c210b78 80030000 <2fa00000> 40defff4 7c421378 4bffffc8
The line it's stuck on:
(gdb) l *ptlrpc_check_set+0x4f4 0x46b04 is in ptlrpc_check_set (/builddir/build/BUILD/lustre-2.3.54/lustre/ptlrpc/client.c:1852).
And the source level info:
1849 >------->------->-------libcfs_nid2str(imp->imp_connection->c_peer.nid), 1850 >------->------->-------lustre_msg_get_opc(req->rq_reqmsg)); 1851 1852 cfs_spin_lock(&imp->imp_lock); 1853 /* Request already may be not on sending or delaying list. This 1854 * may happen in the case of marking it erroneous for the case 1855 * ptlrpc_import_delay_req(req, status) find it impossible to
The symptoms look very similar to what we saw on the old Orion code base, which boiled down to "too many" ptlrpc threads contending with each other while interrupts were disabled. To get past that previous issue we pulled in a patch to limit the number of threads created, thus reducing the contention on the lock in question.