[LU-1342] Test failure on sanity-quota test_29 Created: 24/Apr/12  Updated: 17/Apr/13  Resolved: 22/Dec/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.1
Fix Version/s: Lustre 2.1.4

Type: Bug Priority: Minor
Reporter: Jay Lan (Inactive) Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

Server: rhel6.2 with lustre-2.1.1
Client: rhel6.2 with lustre-client-2.1.1

MDS/MGS: service360
OSS1: service361
OSS2: service362
Client1: service333
Client2: service334


Attachments: File nas.v3.sh     File ncli_nas.v3.sh     File sanity-quota-1335289931.tar.bz2    
Severity: 3
Rank (Obsolete): 4547

 Description   

This looks like a duplicate of LU-492, but my software contains the fix of LU-492. The patch of LU-492 did not help in my testing.

The git source of our code is at https://github.com/jlan/lustre-nas/tree/nas-2.1.1

The command I issued was:

  1. ONLY=29 cfg/nas.v3.sh SANITY_QUOTA
    The script files nas.v3.sh and ncli_nas.v3.sh are attached.
    The test log tarball sanity-quota-1335289931.tar.bz2 is also attached.

The failure is reproducible.

test_29()
{
...

  1. actually send a RPC to make service at_current confined within at_max
    $LFS setquota -u $TSTUSR -b 0 -B $BLK_LIMIT -i 0 -I 0 $DIR || error "should succeed"
    <=== succeeded

#define OBD_FAIL_MDS_QUOTACTL_NET 0x12e
lustre_fail mds 0x12e
<==== fine

$LFS setquota -u $TSTUSR -b 0 -B $BLK_LIMIT -i 0 -I 0 $DIR & pid=$!
<==== "setquota failed: Transport endpoint is not connected"

echo "sleeping for 10 * 1.25 + 5 + 10 seconds"
sleep 28
ps -p $pid && error "lfs hadn't finished by timeout"
<==== the process still alive. Die later due to timeout.
...

Is "setquota failed: Transport endpoint is not connected" error expected?
I saw that in the test log.
Was that the result of "lustre_fail mds 0x12e", or did that mean the mds did not see the lustre_fail request? Remote commands were sent via pdsh.

If I tried a "sleep 40" (instead of "sleep 28" after that, the lfs
command timed out before the check and the test passed. It seems
the sleep formula "10 * 1.25 + 5 + 10 seconds" is not long enough?



 Comments   
Comment by Peter Jones [ 25/Apr/12 ]

Bobi

Could you please look into this one?

Thanks

Peter

Comment by Zhenyu Xu [ 25/Apr/12 ]

Guess the test needs take the net latency into the wait time value.

Patch tracking at http://review.whamcloud.com/2601

Comment by Bob Glossman (Inactive) [ 26/Apr/12 ]

"setquota failed: Transport endpoint is not connected" is the expected error.
This is the error reported when the client rpc fails within the sleep time.
The write to fail_loc on your server is happening correctly. It causes the MDS_QUOTACTL rpc handled on the server to abort & not ever return a reply. This is shown by the following from the server log:

00000100:00100000:7.0:1335289642.393283:0:14946:0:(service.c:1536:ptlrpc_server_handle_req_in()) got req x1398863346727737
00000100:00100000:7.0:1335289642.393297:0:14946:0:(service.c:1713:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_01:d1093c79-65be-f3fa-3770-e19614ebeee7+6:31674:x1398863346727737:12345-10.151.25.184@o2ib:48
00000004:00020000:7.0:1335289642.393301:0:14946:0:(libcfs_fail.h:81:cfs_fail_check_set()) *** cfs_fail_loc=12e ***
00000100:00100000:7.0:1335289642.415971:0:14946:0:(service.c:1760:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_01:d1093c79-65be-f3fa-3770-e19614ebeee7+5:31674:x1398863346727737:12345-10.151.25.184@o2ib:48 Request procesed in 22680us (22700us total) trans 0 rc 0/-999

In the client log I see the following related to the failing rpc (xid = 1398863346727737, opcode = 48 = MDS_QUOTACTL)

00000100:00100000:5.0:1335289903.436104:0:31674:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc lfs:d1093c79-65be-f3fa-3770-e19614ebeee7:31674:1398863346727737:10.151.26.38@o2ib:48
00000100:00100000:5.0:1335289903.436135:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 23 seconds
00000100:00100000:5.0:1335289926.435314:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds
00000100:00100000:5.0:1335289927.435311:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds
00000100:00100000:5.0:1335289928.435312:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds
00000100:00100000:5.0:1335289929.435312:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds
00000100:00100000:5.0:1335289930.435311:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds
00000100:00100000:5.0:1335289931.435300:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds

The initial sleep of 23 sec shown seems excessively high for the timeout of 10 set by the test script. The formula mentioned in the script comment of makes it seem like the number should be more like 17 or 18 ( 10 * 1.25 + 5 ). I'm wondering if there are some other settings in your environment forcing the rpc timeouts to be higher than normal, for example ldlm_timeout or timeouts related to your interconnect (IB).

In attempting to reproduce this failure locally with tcp interconnect I find my lfs process timing out and returning in way under 10 secs every time. It never comes close to reaching the 28 sec sleep in the test script.

Comment by Jay Lan (Inactive) [ 26/Apr/12 ]

I found the problem.

We set at_min=15 in our systems in addition to at_max.
The test_29 reset at_max to 10, but left at_min to a higher value, thus
forced timeout value to be 15 instead of 10.

The test should save both at_max and at_min before the test, and restore
them back after the test.

Comment by Zhenyu Xu [ 26/Apr/12 ]

The sleep time should be for the worst case, and I think I can improve the test script by checking hte lfs process before the deadline, and that would be better.

Comment by Jay Lan (Inactive) [ 27/Apr/12 ]

How do I add comment to http://review.whamcloud.com/2601 ?

The Patch Set 2 did work in my environment. However, remember that my problem
was because test_29 lowered at_max on client to a value smaller than at_min. Our site
set at_min to 15, so deadline=56 would work, but theoretically it can fail at a
site with a higher at_min value.

BTW, I understand the first 10 of the formula "2 * (10 * 1.25 + 5 + 10)" is the
timeout value test_29 wants to set. I guess the second 10 is not the same timeout
value; otherwise, you would simply do 10 * 2.25 So, what is the second 10?

Comment by Peter Jones [ 27/Apr/12 ]

> How do I add comment to http://review.whamcloud.com/2601 ?

I would guess that the missing step would be to login to gerrit...

Comment by Jay Lan (Inactive) [ 04/Sep/12 ]

The patch set #7 was landed to master on July 12.
Can we landed this patch to b2_1 branch and close the ticket?

Comment by Zhenyu Xu [ 04/Sep/12 ]

b2_1 patch port tracking at http://review.whamcloud.com/3870

Comment by Peter Jones [ 22/Dec/12 ]

Landed for 2.1.4 and 2.4

Generated at Sat Feb 10 01:15:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.