Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.10.7
-
None
-
3
-
9223372036854775807
Description
symptom is that clients cannot access lustre filesystem data. Seeing timeouts in the logs, e.g.,:
Jun 27 06:58:10 vanlustre3 kernel: Lustre: 86713:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1561643289/real 1561643289] req@ffff9bffcbd3b900 x1637338661347344/t0(0) o36->echo-MDT0000-mdc-ffff9c2b2b775000@10.23.22.104@tcp:12/10 lens 880/856 e 24 to 1 dl 1561643890 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 Jun 27 06:58:10 vanlustre3 kernel: Lustre: 86713:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Jun 27 06:58:10 vanlustre3 kernel: Lustre: echo-MDT0000-mdc-ffff9c2b2b775000: Connection to echo-MDT0000 (at 10.23.22.104@tcp) was lost; in progress operations using this service will wait for recovery to complete Jun 27 06:58:10 vanlustre3 kernel: Lustre: echo-MDT0000-mdc-ffff9c2b2b775000: Connection restored to 10.23.22.104@tcp (at 10.23.22.104@tcp)
On the MDS we see:
Jun 27 06:55:08 emds1 kernel: LustreError: 27539:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1561643408, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-echo-MDT0000_UUID lock: ffff88522ed35800/0x7d046634332b9f1e lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2 rrc: 8 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 27539 timeout: 0 lvb_type: 0 Jun 27 07:00:04 emds1 kernel: Lustre: 27723:0:(service.c:1346:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (4/4), not sending early reply#012 req@ffff88231ac4f500 x1637338661348352/t0(0) o36->e46f0dd3-8775-ce8c-a09f-d393cecffa21@10.23.22.113@tcp:498/0 lens 928/3128 e 1 to 0 dl 1561644008 ref 2 fl Interpret:/0/0 rc 0/0 Jun 27 07:00:37 emds1 kernel: Lustre: 49916:0:(service.c:2114:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (118:3300s); client may timeout. req@ffff8823586ce900 x1637332991328208/t910888684941(0) o36->e46f0dd3-8775-ce8c-a09f-d393cecffa21@10.23.22.113@tcp:247/0 lens 680/424 e 3 to 0 dl 1561640737 ref 1 fl Complete:/0/0 rc 0/0 Jun 27 07:00:37 emds1 kernel: LNet: Service thread pid 49916 completed after 3417.78s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
We see a ldlm_bl_01 (or ldlm_bl_02) at 100% on a CPU core for extended periods (over an hour). It will recover for a few minutes, then max out a CPU again
What might be causing this?