Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.11.0
-
Soak test cluster, lustre-master build 3606 version=2.9.59_32_g62bc3af
-
3
-
9223372036854775807
Description
Sequence:
- MDS failover occurs.
- failover nodes complete.
- recovery across all MDS blocks
Jul 7 15:34:17 soak-9 kernel: LDISKFS-fs warning (device dm-6): ldiskfs_multi_mount_protect:322: MMP interval 42 higher than expected, please wait. Jul 7 15:35:00 soak-9 kernel: LDISKFS-fs (dm-6): recovery complete Jul 7 15:35:00 soak-9 kernel: LDISKFS-fs (dm-6): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc Jul 7 15:35:06 soak-9 kernel: LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 192.168.1.128@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. Jul 7 15:35:06 soak-9 kernel: Lustre: soaked-MDT0001: Not available for connect from 192.168.1.132@o2ib (not set up) Jul 7 15:35:06 soak-9 kernel: LustreError: 11-0: soaked-MDT0000-osp-MDT0001: operation mds_connect to node 192.168.1.108@o2ib failed: rc = -114 Jul 7 15:35:07 soak-9 kernel: Lustre: soaked-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 Jul 7 15:35:09 soak-9 kernel: Lustre: soaked-MDT0001: Will be in recovery for at least 2:30, or until 37 clients reconnect
The failover node stays in a WAITING state:
soak-10 ---------------- mdt.soaked-MDT0002.recovery_status= status: WAITING non-ready MDTs: 0003 recovery_start: 1499451258 time_waited: 2147 Jul 7 18:29:12 soak-10 kernel: LustreError: 11-0: soaked-MDT0003-osp-MDT0002: operation mds_connect to node 192.168.1.111@o2ib failed: rc = -114 Jul 7 18:29:12 soak-10 kernel: LustreError: Skipped 11 previous similar messages Jul 7 18:29:13 soak-10 kernel: Lustre: 3682:0:(ldlm_lib.c:1784:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reaching hard limit: 900, extend: 1 Jul 7 18:29:13 soak-10 kernel: Lustre: 3682:0:(ldlm_lib.c:1784:extend_recovery_timer()) Skipped 9 previous similar messages Jul 7 18:29:29 soak-10 kernel: Lustre: soaked-MDT0002: Recovery already passed deadline 0:08, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
dumped lustre-logs on the MDS multiple times during this, dumped stacks, attached
Attachments
Issue Links
- is related to
-
LU-9274 LBUG: (recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed:
-
- Resolved
-