Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Lustre 2.4.1
-
None
-
EL6
-
3
-
16763
Description
We have seen this issue twice in the last week as the load on one of our Lustre filesystems has increased. Long story short - we have 4 Lustre clients exporting the filesystem over NFS (round-robin). The MDS seems to get snagged on a lock and then the clients take it in turn to drop off which sends IO errors to the NFS clients.
Dec 10 16:00:02 cmds1 kernel: LNet: Service thread pid 22181 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Dec 10 16:00:02 cmds1 kernel: Pid: 22181, comm: mdt00_026 Dec 10 16:00:02 cmds1 kernel: Dec 10 16:00:02 cmds1 kernel: Call Trace: Dec 10 16:00:02 cmds1 kernel: [<ffffffff8150f362>] schedule_timeout+0x192/0x2e0 Dec 10 16:00:02 cmds1 kernel: [<ffffffff810811e0>] ? process_timeout+0x0/0x10 Dec 10 16:00:02 cmds1 kernel: [<ffffffffa03fe6d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs] Dec 10 16:00:02 cmds1 kernel: Lustre: lock timed out (enqueued at 1418227002, 200s ago) Dec 10 16:00:02 cmds1 kernel: [<ffffffffa06a601d>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc] Dec 10 16:00:02 cmds1 kernel: [<ffffffffa06a1790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] Dec 10 16:00:02 cmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Dec 10 16:00:02 cmds1 kernel: [<ffffffffa06a5758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] Dec 10 16:00:02 cmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1418227202.22181 Dec 10 16:00:02 cmds1 kernel: [<ffffffffa06a5b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Dec 10 16:00:02 cmds1 kernel: [<ffffffffa0d93a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Dec 10 16:00:02 cmds1 kernel: [<ffffffffa0d99c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt] Dec 10 16:00:02 cmds1 kernel: [<ffffffffa0d93a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Dec 10 16:00:02 cmds1 kernel: [<ffffffffa06a5b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Dec 10 16:00:02 cmds1 kernel: [<ffffffffa0d9a4f4>] mdt_object_lock+0x14/0x20 [mdt] Dec 10 16:00:02 cmds1 kernel: [<ffffffffa0dc24f8>] mdt_object_open_lock+0x1c8/0x510 [mdt] Dec 10 16:00:02 cmds1 kernel: [<ffffffffa0da6bfa>] ? mdt_attr_get_complex+0x38a/0x770 [mdt] Dec 10 16:00:02 cmds1 kernel: [<ffffffffa0dc76b3>] mdt_open_by_fid_lock+0x443/0x7d0 [mdt] Dec 10 16:00:02 cmds1 kernel: [<ffffffffa0dc829b>] mdt_reint_open+0x56b/0x20c0 [mdt] Dec 10 16:00:02 cmds1 kernel: [<ffffffffa041a82e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs] Dec 10 16:00:02 cmds1 kernel: [<ffffffffa06cedcc>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc] Dec 10 16:00:02 cmds1 kernel: [<ffffffffa0565f50>] ? lu_ucred+0x20/0x30 [obdclass] Dec 10 16:00:02 cmds1 kernel: [<ffffffffa0d93015>] ? mdt_ucred+0x15/0x20 [mdt] Dec 10 16:00:02 cmds1 kernel: [<ffffffffa0daf15c>] ? mdt_root_squash+0x2c/0x410 [mdt] Dec 10 16:00:02 cmds1 kernel: [<ffffffffa06f6646>] ? __req_capsule_get+0x166/0x700 [ptlrpc] .. .. Dec 10 16:07:58 cmds1 kernel: Dec 10 16:07:58 cmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1418227678.7488 Dec 10 16:11:13 cmds1 kernel: Lustre: 22178:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-271), not sending early reply Dec 10 16:11:13 cmds1 kernel: req@ffff882fd6dfd000 x1486501717804232/t0(0) o101->4540220b-aff6-ec4a-3862-59a2488b6775@10.21.22.27@tcp:0/0 lens 576/3448 e 5 to 0 dl 1418227878 ref 2 fl Interpret:/0/0 rc 0/0 Dec 10 16:11:44 cmds1 kernel: Lustre: 9532:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-271), not sending early reply Dec 10 16:11:44 cmds1 kernel: req@ffff88427dd19400 x1486495779658676/t0(0) o101->dce9c151-cc13-2ac2-7911-8ded672c09d6@10.21.22.29@tcp:0/0 lens 608/3448 e 5 to 0 dl 1418227909 ref 2 fl Interpret:/0/0 rc 0/0 Dec 10 16:12:02 cmds1 kernel: Lustre: 7487:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-260), not sending early reply Dec 10 16:12:02 cmds1 kernel: req@ffff88026471d800 x1486501720068016/t0(0) o101->4540220b-aff6-ec4a-3862-59a2488b6775@10.21.22.27@tcp:0/0 lens 576/3448 e 3 to 0 dl 1418227927 ref 2 fl Interpret:/0/0 rc 0/0 Dec 10 16:12:34 cmds1 kernel: Lustre: 6826:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-134), not sending early reply Dec 10 16:12:34 cmds1 kernel: req@ffff8822e2c0d000 x1486501725457272/t0(0) o101->4540220b-aff6-ec4a-3862-59a2488b6775@10.21.22.27@tcp:0/0 lens 608/3448 e 1 to 0 dl 1418227959 ref 2 fl Interpret:/0/0 rc 0/0 Dec 10 16:13:17 cmds1 kernel: Lustre: charlie-MDT0000: Client 4540220b-aff6-ec4a-3862-59a2488b6775 (at 10.21.22.27@tcp) reconnecting Dec 10 16:13:17 cmds1 kernel: Lustre: charlie-MDT0000: Client 4540220b-aff6-ec4a-3862-59a2488b6775 (at 10.21.22.27@tcp) refused reconnection, still busy with 6 active RPCs Dec 10 16:13:42 cmds1 kernel: Lustre: charlie-MDT0000: Client 4540220b-aff6-ec4a-3862-59a2488b6775 (at 10.21.22.27@tcp) reconnecting Dec 10 16:13:42 cmds1 kernel: Lustre: charlie-MDT0000: Client 4540220b-aff6-ec4a-3862-59a2488b6775 (at 10.21.22.27@tcp) refused reconnection, still busy with 3 active RPCs Dec 10 16:13:48 cmds1 kernel: Lustre: charlie-MDT0000: Client dce9c151-cc13-2ac2-7911-8ded672c09d6 (at 10.21.22.29@tcp) reconnecting Dec 10 16:13:48 cmds1 kernel: Lustre: charlie-MDT0000: Client dce9c151-cc13-2ac2-7911-8ded672c09d6 (at 10.21.22.29@tcp) refused reconnection, still busy with 2 active RPCs Dec 10 16:13:53 cmds1 kernel: Lustre: 22166:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-431), not sending early reply Dec 10 16:13:53 cmds1 kernel: req@ffff882fe3e6a000 x1486492815968152/t0(0) o101->b9fd8649-3e4d-fb11-ba42-c2526b8f651f@10.21.22.28@tcp:0/0 lens 576/1152 e 4 to 0 dl 1418228038 ref 2 fl Interpret:/0/0 rc 0/0 Dec 10 16:14:07 cmds1 kernel: Lustre: charlie-MDT0000: Client 4540220b-aff6-ec4a-3862-59a2488b6775 (at 10.21.22.27@tcp) reconnecting Dec 10 16:14:07 cmds1 kernel: Lustre: charlie-MDT0000: Client 4540220b-aff6-ec4a-3862-59a2488b6775 (at 10.21.22.27@tcp) refused reconnection, still busy with 3 active RPCs Dec 10 16:14:13 cmds1 kernel: Lustre: charlie-MDT0000: Client dce9c151-cc13-2ac2-7911-8ded672c09d6 (at 10.21.22.29@tcp) reconnecting Dec 10 16:14:13 cmds1 kernel: Lustre: charlie-MDT0000: Client dce9c151-cc13-2ac2-7911-8ded672c09d6 (at 10.21.22.29@tcp) refused reconnection, still busy with 2 active RPCs Dec 10 16:14:32 cmds1 kernel: Lustre: charlie-MDT0000: Client 4540220b-aff6-ec4a-3862-59a2488b6775 (at 10.21.22.27@tcp) reconnecting Dec 10 16:14:32 cmds1 kernel: Lustre: charlie-MDT0000: Client 4540220b-aff6-ec4a-3862-59a2488b6775 (at 10.21.22.27@tcp) refused reconnection, still busy with 3 active RPCs Dec 10 16:14:57 cmds1 kernel: Lustre: charlie-MDT0000: Client 4540220b-aff6-ec4a-3862-59a2488b6775 (at 10.21.22.27@tcp) reconnecting Dec 10 16:14:57 cmds1 kernel: Lustre: Skipped 1 previous similar message Dec 10 16:14:57 cmds1 kernel: Lustre: charlie-MDT0000: Client 4540220b-aff6-ec4a-3862-59a2488b6775 (at 10.21.22.27@tcp) refused reconnection, still busy with 3 active RPCs Dec 10 16:14:57 cmds1 kernel: Lustre: Skipped 1 previous similar message
Both times, the only way to get back to a stable state was to reboot the mds and abort recovery. Logs attached.
Attachments
Issue Links
- is related to
-
LU-4152 layout locks can cause deadlock
- Resolved