Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.9.0, Lustre 2.10.0
-
Soak cluster lustre: 2.8.60_5_gcc5601d
-
3
-
9223372036854775807
Description
LFSCK fails to complete on lola-8, MDT0000:
lctl lfsck_start -M soaked-MDT0000 -s 1000 -t namespace fi fi 2016-11-21 13:21:36,440:fsmgmt.fsmgmt:INFO lfsck started on lola-8 2016-11-21 13:21:52,069:fsmgmt.fsmgmt:INFO lfsck still in progress for soaked-MDT0000 after 15s 2016-11-21 13:22:22,672:fsmgmt.fsmgmt:INFO lfsck still in progress for soaked-MDT0000 after 45s 2016-11-21 13:23:23,898:fsmgmt.fsmgmt:INFO lfsck still in progress for soaked-MDT0000 after 105s 2016-11-21 13:25:26,280:fsmgmt.fsmgmt:INFO lfsck still in progress for soaked-MDT0000 after 225s 2016-11-21 13:29:31,072:fsmgmt.fsmgmt:INFO lfsck still in progress for soaked-MDT0000 after 465s 2016-11-21 13:37:40,601:fsmgmt.fsmgmt:INFO lfsck still in progress for soaked-MDT0000 after 945s 2016-11-21 13:53:59,778:fsmgmt.fsmgmt:INFO lfsck still in progress for soaked-MDT0000 after 1905s 2016-11-21 14:26:38,226:fsmgmt.fsmgmt:INFO lfsck still in progress for soaked-MDT0000 after 3825s 2016-11-21 15:31:55,117:fsmgmt.fsmgmt:INFO lfsck still in progress for soaked-MDT0000 after 7665s
I aborted LFSCK with lfsck_stop
The LFSCK stopped, but clients and other servers were not able to re-connect.
Example client:
Lustre: Evicted from MGS (at MGC192.168.1.108@o2ib10_0) after server handle changed from 0xd9fafa0ca7b8e5dc to 0x732870fe43aa2fe7 Lustre: MGC192.168.1.108@o2ib10: Connection restored to MGC192.168.1.108@o2ib10_0 (at 192.168.1.108@o2ib10) Lustre: Skipped 1 previous similar message LustreError: 183198:0:(lmv_obd.c:1402:lmv_statfs()) can't stat MDS #0 (soaked-MDT0000-mdc-ffff880426c9c000), error -4 LustreError: 183198:0:(llite_lib.c:1736:ll_statfs_internal()) md_statfs fails: rc = -4
The system appears to be wedged in this state, rebooting and remounting the lola-8 MDT does no fix the issue.
I dumped the Lustre log on lola-8 while it was in LFSCK, attached.
Also, the lfsck_layout