[LU-1246] SANITY_QUOTA test_32 failed in cleanup_and_setup_lustre with LOAD_MODULES_REMOTE=true Created: 21/Mar/12 Updated: 30/May/12 Resolved: 30/May/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Jay Lan (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
One mgs/mds, two OSS, two clients, lustre-1.8.6.81. |
||
| Severity: | 3 |
| Rank (Obsolete): | 6108 |
| Description |
|
SANITY_QUOTA test_32 always failed. Test was started from service331 (a lustre client actually): ... It seems this test is the only one that set LOAD_MODULES_REMOTE=true before calling cleanup_and_setup_lustre and failed. Sometimes only OST1 had error 108 problem, sometimes both OST1 and OST2 were hit with this problem. I put "sleep 3" in setupall() The 'demsg' from MDS (service360) showed: The 'dmesg' from OST1 (service361) showed: |
| Comments |
| Comment by Jay Lan (Inactive) [ 21/Mar/12 ] |
|
Each time after failure, a 'lctl ping' between mds and ost's (both direction) worked. Manually executing mount command from ost also worked. Only failed during the test. |
| Comment by Peter Jones [ 21/Mar/12 ] |
|
Niu Could you please comment? Thanks Peter |
| Comment by Niu Yawei (Inactive) [ 22/Mar/12 ] |
|
Hi, Jay Could you try to comment out the "LOAD_MODULES_REMOTE=true" in the sanity-quota test_32() to see if the problem will be gone? In the load_modules() of test-framework.sh, there is a comment: # bug 19124
# load modules on remote nodes optionally
# lustre-tests have to be installed on these nodes
Could you make sure that the lustre-tests are installed correctly on remote nodes(MDS & OSS)? Thanks. |
| Comment by Jay Lan (Inactive) [ 22/Mar/12 ] |
|
Hi Niu, 1. I turned off LOAD_MODULES_REMOTE=yes a few days ago, and the problem went away. |
| Comment by Jay Lan (Inactive) [ 22/Mar/12 ] |
|
BTW, by "all other tests" I meant the test suites that Maloo runs when a new patch |
| Comment by Niu Yawei (Inactive) [ 23/Mar/12 ] |
|
Thanks, Jay. I don't know why the OST can't communicate with the MGS in your case. Is it possible to get a full debug log on MDS & OSS? (you can set the PTLDEBUG to -1 on MDS & OSS node, I think the test will dump debug log automatically when test failed, or you can dump the debug log to file by 'lctl dk') |
| Comment by Jay Lan (Inactive) [ 26/Mar/12 ] |
|
Hi Niu, I will have to do that later. The MDS and OSS nodes have been re-imaged to rhel6.2 with lustre-2.1.1 server code. When I am done with 2.1.1 I will re-image them back to 1.8.6 and provide you information you need. BTW, does this following message (cited from the dmesg of OSS1 (from the "Description" of this bug report) imply the timeout was first occured within MDS node? Lustre: 5972:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1395992728675719 sent from MGC10.151.26.38@o2ib to NID 10.151.26.38@o2ib 6s ago has timed out (6s prior to deadline). |
| Comment by Niu Yawei (Inactive) [ 26/Mar/12 ] |
|
Thanks, Jay.
I think it indicating the OST to MGS request timeout. |
| Comment by Jay Lan (Inactive) [ 30/May/12 ] |
|
Hi Niu, We have upgraded our servers to 2.1.1 last week, and I have not seen this problem in testing with 2.1 servers. Thus, this problem is not important to us any more. You may close it. |
| Comment by Peter Jones [ 30/May/12 ] |
|
ok thanks Jay |