Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Lustre 2.7.0
-
None
-
OpenSFS cluster running lustre-master tag 2.6.91 build # 2771 with two MDSs with one MDT each, three OSSs with two OSTs each and three clients.
-
3
-
16869
Description
While running the LFSCK Phase 3 test plan, replay-dual test 9 failed with the error in the routine fail():
c13: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 181 sec c12: stat: cannot read file system information for `/lustre/scratch': Input/output error
replay-dual test 10 failed with the same error message:
c13: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec c11: stat: cannot read file system information for `/lustre/scratch': Input/output error pdsh@c13: c11: ssh exited with exit code 1 c13: stat: cannot read file system information for `/lustre/scratch': Input/output error
The test results are at https://testing.hpdd.intel.com/test_sets/78dc0abe-861b-11e4-ac52-5254006e85c2
It’s not clear from the logs what is related to this error. For test 9, the client that could not stat the file, c12, has the following in dmesg right before the test fails
00800000:00020000:5.0:1418766109.528501:0:25671:0:(lmv_obd.c:1477:lmv_statfs()) can't stat MDS #0 (scratch-MDT0000-mdc-ffff8808028cbc00), error -5
On the primary MDS, MDS0, the recovery looks like it having issues:
Lustre: *** cfs_fail_loc=119, val=2147483648*** LustreError: 12646:0:(ldlm_lib.c:2384:target_send_reply_msg()) @@@ dropping reply req@ffff880d0ee74c80 x1487677070285728/t128849018882(128849018882) o36->558cba8f-7f43-4143-5d8a-c7adfced85eb@192.168.2.112@o2ib:308/0 lens 488/448 e 0 to 0 dl 1418766108 ref 1 fl Complete:/4/0 rc 0/0 Lustre: scratch-MDT0000: recovery is timed out, evict stale exports Lustre: scratch-MDT0000: disconnecting 1 stale clients Lustre: 12646:0:(ldlm_lib.c:1767:target_recovery_overseer()) recovery is aborted by hard timeout Lustre: 12646:0:(ldlm_lib.c:1773:target_recovery_overseer()) recovery is aborted, evict exports in recovery Lustre: 12646:0:(ldlm_lib.c:1773:target_recovery_overseer()) Skipped 2 previous similar messages Lustre: 12646:0:(ldlm_lib.c:1415:abort_req_replay_queue()) @@@ aborted: req@ffff880275011380 x1487683234832604/t0(128849018884) o36->d08d2f7b-4c89-7208-ad20-237f0ed0a102@192.168.2.113@o2ib:294/0 lens 488/0 e 6 to 0 dl 1418766094 ref 1 fl Complete:/4/ffffffff rc 0/-1 Lustre: 12646:0:(ldlm_lib.c:1767:target_recovery_overseer()) recovery is aborted by hard timeout Lustre: 12646:0:(ldlm_lib.c:2060:target_recovery_thread()) too long recovery - read logs Lustre: scratch-MDT0000: Recovery over after 3:01, of 7 clients 1 recovered and 6 were evicted. LustreError: dumping log to /tmp/lustre-log.1418766079.12646 Lustre: Skipped 3 previous similar messages Lustre: DEBUG MARKER: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 181 sec Lustre: DEBUG MARKER: replay-dual test_9: @@@@@@ FAIL: post-failover df: 1