Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.7.0
Labels:
None
Environment:
OpenSFS cluster running lustre-master tag 2.6.91 build # 2771 with two MDSs with one MDT each, three OSSs with two OSTs each and three clients.

Severity:
3
Rank (Obsolete):
16869

Description

While running the LFSCK Phase 3 test plan, replay-dual test 9 failed with the error in the routine fail():

c13: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 181 sec
c12: stat: cannot read file system information for `/lustre/scratch': Input/output error

replay-dual test 10 failed with the same error message:

c13: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
c11: stat: cannot read file system information for `/lustre/scratch': Input/output error
pdsh@c13: c11: ssh exited with exit code 1
c13: stat: cannot read file system information for `/lustre/scratch': Input/output error

The test results are at https://testing.hpdd.intel.com/test_sets/78dc0abe-861b-11e4-ac52-5254006e85c2

It’s not clear from the logs what is related to this error. For test 9, the client that could not stat the file, c12, has the following in dmesg right before the test fails

00800000:00020000:5.0:1418766109.528501:0:25671:0:(lmv_obd.c:1477:lmv_statfs()) can't stat MDS #0 (scratch-MDT0000-mdc-ffff8808028cbc00), error -5

On the primary MDS, MDS0, the recovery looks like it having issues:

Lustre: *** cfs_fail_loc=119, val=2147483648***
LustreError: 12646:0:(ldlm_lib.c:2384:target_send_reply_msg()) @@@ dropping reply  req@ffff880d0ee74c80 x1487677070285728/t128849018882(128849018882) o36->558cba8f-7f43-4143-5d8a-c7adfced85eb@192.168.2.112@o2ib:308/0 lens 488/448 e 0 to 0 dl 1418766108 ref 1 fl Complete:/4/0 rc 0/0
Lustre: scratch-MDT0000: recovery is timed out, evict stale exports
Lustre: scratch-MDT0000: disconnecting 1 stale clients
Lustre: 12646:0:(ldlm_lib.c:1767:target_recovery_overseer()) recovery is aborted by hard timeout
Lustre: 12646:0:(ldlm_lib.c:1773:target_recovery_overseer()) recovery is aborted, evict exports in recovery
Lustre: 12646:0:(ldlm_lib.c:1773:target_recovery_overseer()) Skipped 2 previous similar messages
Lustre: 12646:0:(ldlm_lib.c:1415:abort_req_replay_queue()) @@@ aborted:  req@ffff880275011380 x1487683234832604/t0(128849018884) o36->d08d2f7b-4c89-7208-ad20-237f0ed0a102@192.168.2.113@o2ib:294/0 lens 488/0 e 6 to 0 dl 1418766094 ref 1 fl Complete:/4/ffffffff rc 0/-1
Lustre: 12646:0:(ldlm_lib.c:1767:target_recovery_overseer()) recovery is aborted by hard timeout
Lustre: 12646:0:(ldlm_lib.c:2060:target_recovery_thread()) too long recovery - read logs
Lustre: scratch-MDT0000: Recovery over after 3:01, of 7 clients 1 recovered and 6 were evicted.
LustreError: dumping log to /tmp/lustre-log.1418766079.12646
Lustre: Skipped 3 previous similar messages
Lustre: DEBUG MARKER: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 181 sec
Lustre: DEBUG MARKER: replay-dual test_9: @@@@@@ FAIL: post-failover df: 1

Attachments

Issue Links

is related to

LU-6238 replay-dual test 10 fails with "FAIL: test_10 failed with 1 "

Resolved

is related to

LU-6084 Tests are failed due to 'recovery is aborted by hard timeout'

Resolved

Activity

People

Assignee:: nasf (Inactive)

Reporter:: James Nunez (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 19/Dec/14 5:34 PM

Updated:: 12/Feb/15 6:30 PM

Resolved:: 20/Jan/15 2:59 AM