[LU-801] Test failure on test suite liblustre, subtest test_1 Created: 28/Oct/11  Updated: 19/Jan/12  Resolved: 19/Jan/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Chris Gearing (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-703 liblustre timeout Resolved
Severity: 3
Rank (Obsolete): 6534

 Description   

This issue was created by maloo for Chris Gearing <chris@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/b2aec26e-00e9-11e1-bd0b-52540025f9af.

See errors in dmesg for client vm1

The sub-test test_1 failed with the following error:

test failed to respond and timed out

Info required for matching: liblustre 1



 Comments   
Comment by Johann Lombardi (Inactive) [ 24/Nov/11 ]

I looked at the logs and it seems that lustre was not running on the MDS, that's why liblustre did not manage to connect to the MGS.

Lustre: DEBUG MARKER: == lfsck lfsck.sh test complete, duration 131 sec ==================================================== 19:20:53 (1319682053)
Lustre: Modifying parameter lustre-MDT0000.mdd.quota_type in log lustre-MDT0000
Lustre: Skipped 7 previous similar messages
LustreError: 3562:0:(ldlm_request.c:1173:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
LustreError: 3562:0:(ldlm_request.c:1173:ldlm_cli_cancel_req()) Skipped 1 previous similar message
LustreError: 3562:0:(ldlm_request.c:1800:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 
LustreError: 3562:0:(ldlm_request.c:1800:ldlm_cli_cancel_list()) Skipped 1 previous similar message
LustreError: 2249:0:(mgs_handler.c:804:mgs_handle()) lustre_mgs: operation 251 on unconnected MGS
LustreError: 2249:0:(ldlm_lib.c:2163:target_send_reply_msg()) @@@ processing error (-107)  req@ffff81003ac8e050 x1383767884708251/t0(0) o-1-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1319682066 ref 1 fl Interpret:/ffffffff/ffffffff rc -107/-1
LustreError: 11-0: an error occurred while communicating with 0@lo. The mgs_disconnect operation failed with -107 
LustreError: Skipped 2 previous similar messages
Lustre: MGS has stopped.
Lustre: server umount lustre-MDT0000 complete
Lustre: DEBUG MARKER: -----============= acceptance-small: liblustre ============----- Wed Oct 26 19:21:50 PDT 2011 
Lustre: DEBUG MARKER: == liblustre test 1: liblustre sanity ================================================================ 19:21:55 (1319682115)
SysRq : Show State

So the mgs/mds has not been remounted before running the liblustre tests. I also checked the sysrq-t output and there is indeed no mds threads running.

liblustre sanity definitely requires the filesystem to be up and running before being invoked. Is it an autotest issue?

Comment by Johann Lombardi (Inactive) [ 24/Nov/11 ]

Chris, Yujian, any thoughts?

Comment by Chris Gearing (Inactive) [ 19/Dec/11 ]

I don't have any useful input I'm afraid, I did look at the logs before raising the bug.

Comment by Peter Jones [ 04/Jan/12 ]

Minh

Are you able to comment on this one?

Thanks

Peter

Comment by Minh Diep [ 09/Jan/12 ]

I have looked at the log and the past lfsck log. We don't see lustre unmount at the end of lfsck. I don't know when this log showed that. I even ran manually and no umount at all. Resigning this back to Chris since this might relate to how autotest run lfsck

Comment by Andreas Dilger [ 19/Jan/12 ]

Closing as a duplicate of LU-703

Generated at Sat Feb 10 01:10:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.