[LU-1342] Test failure on sanity-quota test_29 Created: 24/Apr/12 Updated: 17/Apr/13 Resolved: 22/Dec/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.1 |
| Fix Version/s: | Lustre 2.1.4 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Jay Lan (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Server: rhel6.2 with lustre-2.1.1 MDS/MGS: service360 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 4547 |
| Description |
|
This looks like a duplicate of The git source of our code is at https://github.com/jlan/lustre-nas/tree/nas-2.1.1 The command I issued was:
The failure is reproducible. test_29()
#define OBD_FAIL_MDS_QUOTACTL_NET 0x12e $LFS setquota -u $TSTUSR -b 0 -B $BLK_LIMIT -i 0 -I 0 $DIR & pid=$! echo "sleeping for 10 * 1.25 + 5 + 10 seconds" Is "setquota failed: Transport endpoint is not connected" error expected? If I tried a "sleep 40" (instead of "sleep 28" after that, the lfs |
| Comments |
| Comment by Peter Jones [ 25/Apr/12 ] |
|
Bobi Could you please look into this one? Thanks Peter |
| Comment by Zhenyu Xu [ 25/Apr/12 ] |
|
Guess the test needs take the net latency into the wait time value. Patch tracking at http://review.whamcloud.com/2601 |
| Comment by Bob Glossman (Inactive) [ 26/Apr/12 ] |
|
"setquota failed: Transport endpoint is not connected" is the expected error. 00000100:00100000:7.0:1335289642.393283:0:14946:0:(service.c:1536:ptlrpc_server_handle_req_in()) got req x1398863346727737 In the client log I see the following related to the failing rpc (xid = 1398863346727737, opcode = 48 = MDS_QUOTACTL) 00000100:00100000:5.0:1335289903.436104:0:31674:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc lfs:d1093c79-65be-f3fa-3770-e19614ebeee7:31674:1398863346727737:10.151.26.38@o2ib:48 The initial sleep of 23 sec shown seems excessively high for the timeout of 10 set by the test script. The formula mentioned in the script comment of makes it seem like the number should be more like 17 or 18 ( 10 * 1.25 + 5 ). I'm wondering if there are some other settings in your environment forcing the rpc timeouts to be higher than normal, for example ldlm_timeout or timeouts related to your interconnect (IB). In attempting to reproduce this failure locally with tcp interconnect I find my lfs process timing out and returning in way under 10 secs every time. It never comes close to reaching the 28 sec sleep in the test script. |
| Comment by Jay Lan (Inactive) [ 26/Apr/12 ] |
|
I found the problem. We set at_min=15 in our systems in addition to at_max. The test should save both at_max and at_min before the test, and restore |
| Comment by Zhenyu Xu [ 26/Apr/12 ] |
|
The sleep time should be for the worst case, and I think I can improve the test script by checking hte lfs process before the deadline, and that would be better. |
| Comment by Jay Lan (Inactive) [ 27/Apr/12 ] |
|
How do I add comment to http://review.whamcloud.com/2601 ? The Patch Set 2 did work in my environment. However, remember that my problem BTW, I understand the first 10 of the formula "2 * (10 * 1.25 + 5 + 10)" is the |
| Comment by Peter Jones [ 27/Apr/12 ] |
|
> How do I add comment to http://review.whamcloud.com/2601 ? I would guess that the missing step would be to login to gerrit... |
| Comment by Jay Lan (Inactive) [ 04/Sep/12 ] |
|
The patch set #7 was landed to master on July 12. |
| Comment by Zhenyu Xu [ 04/Sep/12 ] |
|
b2_1 patch port tracking at http://review.whamcloud.com/3870 |
| Comment by Peter Jones [ 22/Dec/12 ] |
|
Landed for 2.1.4 and 2.4 |