Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.10.0
-
Soak test cluster, version=lustre: 2.9.51_45_g3b3eeeb - tip of master
-
3
-
9223372036854775807
Description
Failover of lola-11 to lola-10
2017-01-24 16:44:56,220:fsmgmt.fsmgmt:INFO Mounting soaked-MDT0003 on lola-10 ...
Immediately prior to the failover mount, lola-10 reports hung processes.
Jan 24 16:41:03 lola-10 kernel: Lustre: 4554:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1485304863/real 1485304863] req@ffff8804131dd680 x1557446666855008/t0(0) o38->soaked-MDT0003-osp-MDT0002@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1485304880 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 Jan 24 16:41:03 lola-10 kernel: Lustre: 4554:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 1 previous similar message Jan 24 16:41:35 lola-10 kernel: LNet: Service thread pid 6159 was inactive for 226.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jan 24 16:41:35 lola-10 kernel: Pid: 6159, comm: mdt00_005 Jan 24 16:41:35 lola-10 kernel: Jan 24 16:41:35 lola-10 kernel: Call Trace: Jan 24 16:41:35 lola-10 kernel: [<ffffffff8105e4e3>] ? __wake_up+0x53/0x70 Jan 24 16:41:35 lola-10 kernel: [<ffffffffa0a33bb2>] top_trans_stop+0x782/0xbd0 [ptlrpc] Jan 24 16:41:35 lola-10 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20 Jan 24 16:41:35 lola-10 kernel: [<ffffffffa11b4aac>] lod_trans_stop+0x2bc/0x330 [lod] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1259f31>] mdd_trans_stop+0x21/0xc6 [mdd] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1246e63>] mdd_create+0x1373/0x17e0 [mdd] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1105c44>] ? mdt_version_save+0x84/0x1a0 [mdt] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa110fcfc>] mdt_reint_create+0xbdc/0xfe0 [mdt] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa11013bc>] ? mdt_root_squash+0x2c/0x3f0 [mdt] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09bb2bb>] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc] Jan 24 16:41:35 lola-10 kernel: [<ffffffff81299b7a>] ? strlcpy+0x4a/0x60 Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1102a2a>] ? old_init_ucred_common+0xda/0x2b0 [mdt] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1104f5d>] mdt_reint_rec+0x5d/0x200 [mdt] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa10efd9b>] mdt_reint_internal+0x62b/0xa50 [mdt] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa10f066b>] mdt_reint+0x6b/0x120 [mdt] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa0a1f17c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09ca66b>] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa067c84a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09c56e9>] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc] Jan 24 16:41:35 lola-10 kernel: [<ffffffff81059ca9>] ? __wake_up_common+0x59/0x90 Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09cba11>] ptlrpc_main+0xac1/0x18d0 [ptlrpc] Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09caf50>] ? ptlrpc_main+0x0/0x18d0 [ptlrpc] Jan 24 16:41:35 lola-10 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0 Jan 24 16:41:35 lola-10 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Jan 24 16:41:35 lola-10 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Jan 24 16:41:35 lola-10 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Jan 24 16:41:35 lola-10 kernel: Jan 24 16:41:35 lola-10 kernel: LustreError: dumping log to /tmp/lustre-log.1485304895.6159
Mount occurs
Jan 24 16:46:22 lola-10 kernel: Lustre: soaked-MDT0003: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
...
Jan 24 16:46:27 lola-10 kernel: Lustre: soaked-MDT0003: Will be in recovery for at least 2:30, or until 22 clients reconnect
...
Jan 24 16:47:44 lola-10 kernel: Lustre: 6025:0:(service.c:1331:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
Jan 24 16:47:44 lola-10 kernel: req@ffff88041171b980 x1557448608035648/t21476913294(0) o36->57c15fb3-4c15-952f-3686-35f4c9caa941@192.168.1.126@o2ib100:-1/-1 lens 632/568 e 19 to 0 dl 1485305269 ref 2 fl Interpret:/0/0 rc 0/0
Jan 24 16:47:47 lola-10 kernel: Lustre: 6025:0:(service.c:1331:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
Jan 24 16:47:47 lola-10 kernel: req@ffff880737e13cc0 x1557448490734944/t21476914413(0) o36->1e7aaa7d-f417-ebd3-eff8-7787d5dac449@192.168.1.123@o2ib100:-1/-1 lens 616/3128 e 19 to 0 dl 1485305272 ref 2 fl Interpret:/0/0 rc 0/0
Jan 24 16:47:47 lola-10 kernel: Lustre: 6025:0:(service.c:1331:ptlrpc_at_send_early_reply()) Skipped 22 previous similar messages
Jan 24 16:47:50 lola-10 kernel: Lustre: soaked-MDT0002: Client ffcd9b78-a534-1d1b-1494-f25c5b00edf1 (at 192.168.1.131@o2ib100) reconnecting
Recovery appears to stall
Jan 24 16:50:18 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) soaked-MDT0003: extended recovery timer reaching hard limit: 900, extend: 1 Jan 24 16:50:19 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) soaked-MDT0003: extended recovery timer reaching hard limit: 900, extend: 1 Jan 24 16:50:19 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) Skipped 2 previous similar messages Jan 24 16:50:21 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) soaked-MDT0003: extended recovery timer reaching hard limit: 900, extend: 1
System is still spinning at the present time.