Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.4.1
-
3
-
13959
Description
On our BG/Q client nodes (ppc64), we frequently see soft lockup messages like so:
2014-05-12 17:24:44.936384 {RMP08Ma185908506} [mmcs]{0}.0.1: kernel:BUG: soft lockup - CPU#32 stuck for 67s! [ptlrpcd_0:3202] 2014-05-12 17:24:45.140774 {RMP08Ma185908506} [mmcs]{0}.16.1: BUG: soft lockup - CPU#65 stuck for 67s! [ptlrpcd_1:3203] 2014-05-12 17:24:45.141140 {RMP08Ma185908506} [mmcs]{0}.16.1: Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm 2014-05-12 17:24:45.141506 {RMP08Ma185908506} [mmcs]{0}.16.1: NIP: 80000000003715bc LR: 80000000003717cc CTR: 000000000000007d 2014-05-12 17:24:45.141979 {RMP08Ma185908506} [mmcs]{0}.16.1: REGS: c0000003eb43ab40 TRAP: 0901 Not tainted (2.6.32-358.11.1.bgq.4blueos.V1R2M1.bl2.1_0.ppc64) 2014-05-12 17:24:45.142274 {RMP08Ma185908506} [mmcs]{0}.16.1: MSR: 0000000080029000 <EE,ME,CE> CR: 44224424 XER: 00000000 2014-05-12 17:24:45.142575 {RMP08Ma185908506} [mmcs]{0}.16.1: TASK = c0000003ecc14060[3203] 'ptlrpcd_1' THREAD: c0000003eb438000 CPU: 65 2014-05-12 17:24:45.142979 {RMP08Ma185908506} [mmcs]{0}.16.1: GPR00: 000000000003dfb8 c0000003eb43adc0 80000000003b5550 00000000000015b0 2014-05-12 17:24:45.143352 {RMP08Ma185908506} [mmcs]{0}.16.1: GPR04: c00000031d82c330 0000000000000000 0000000000000000 0000000000000053 2014-05-12 17:24:45.143728 {RMP08Ma185908506} [mmcs]{0}.16.1: GPR08: 0000000000000048 c00000031d82d110 0000000000000071 000000001eaffaf3 2014-05-12 17:24:45.144048 {RMP08Ma185908506} [mmcs]{0}.16.1: GPR12: 0000000000002720 c0000000007f7200 2014-05-12 17:24:45.144424 {RMP08Ma185908506} [mmcs]{0}.16.1: NIP [80000000003715bc] .__adler32+0x9c/0x220 [libcfs] 2014-05-12 17:24:45.144861 {RMP08Ma185908506} [mmcs]{0}.16.1: LR [80000000003717cc] .adler32_update+0x1c/0x40 [libcfs] 2014-05-12 17:24:45.145111 {RMP08Ma185908506} [mmcs]{0}.16.1: Call Trace: 2014-05-12 17:24:45.145442 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43adc0] [c0000003eb43ae50] 0xc0000003eb43ae50 (unreliable) 2014-05-12 17:24:45.145800 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43ae40] [c0000000002041c4] .crypto_shash_update+0x4c/0x60 2014-05-12 17:24:45.146124 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43aeb0] [c000000000204218] .shash_compat_update+0x40/0x80 2014-05-12 17:24:45.146441 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43af70] [800000000037072c] .cfs_crypto_hash_update_page+0x8c/0xb0 [libcfs] 2014-05-12 17:24:45.146758 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b030] [8000000001284f94] .osc_checksum_bulk+0x1c4/0x890 [osc] 2014-05-12 17:24:45.147160 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b180] [80000000012863c0] .osc_brw_prep_request+0xd60/0x1d80 [osc] 2014-05-12 17:24:45.147512 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b340] [800000000129931c] .osc_build_rpc+0xc8c/0x2320 [osc] 2014-05-12 17:24:45.147814 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b4f0] [80000000012bf304] .osc_send_write_rpc+0x594/0xb40 [osc] 2014-05-12 17:24:45.148150 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b6d0] [80000000012bff84] .osc_check_rpcs+0x6d4/0x1770 [osc] 2014-05-12 17:24:45.148447 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b900] [80000000012c132c] .osc_io_unplug0+0x30c/0x6b0 [osc] 2014-05-12 17:24:45.148812 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43ba20] [800000000129b174] .brw_interpret+0x7c4/0x1940 [osc] 2014-05-12 17:24:45.149257 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43bb60] [8000000000edb6b4] .ptlrpc_check_set+0x3a4/0x4bb0 [ptlrpc] 2014-05-12 17:24:45.149659 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43bd20] [8000000000f2a80c] .ptlrpcd_check+0x66c/0x870 [ptlrpc] 2014-05-12 17:24:45.149960 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43be40] [8000000000f2acc8] .ptlrpcd+0x2b8/0x4c0 [ptlrpc] 2014-05-12 17:24:45.150341 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43bf90] [c00000000001b9a8] .kernel_thread+0x54/0x70
I have noticed too that I can watch the cpu usage using top, and I will see about two CPUs that are pegged at exactly 100% CPU usage, all in the sys column. The rest of the CPUs are more evenly split among different usage columns: a little in user, most in sys and wa.
The CPUs with the 100% usage seem to correspond to the ptlrpcd processes that trigger the soft lockup warning, which I suppose is not terribly surprising.
We have seen this happening for well over a year, but are just getting around to look closer at it due to other recent work on the BG/Q clients.
Attachments
Issue Links
- is related to
-
LU-5279 kernel BUG at arch/x86/mm/physaddr.c:20!
- Resolved