[LU-492] Test failure on sanity-quota test_29 Created: 07/Jul/11  Updated: 23/Apr/12  Resolved: 08/Jul/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.1.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 4968

 Description   

This issue was created by maloo for bobijam <bobijam@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/ba4bfe42-a856-11e0-bd2a-52540025f9af.



 Comments   
Comment by Zhenyu Xu [ 07/Jul/11 ]

the test set client at_max to 10 seconds and sleeps 20 seconds expecting quotactl rpc timeout and exit lfs setquota command.

but from client log

21:41:13:Lustre: 15526:0:(client.c:1775:ptlrpc_expire_one_request()) @@@ Request x1373648715776149 sent from lustre-MDT0000-mdc-ffff81006251e400 to NID 10.10.4.72@tcp has timed out for slow reply: [sent 1310013628] [real_sent 1310013628] [current 1310013672] [deadline 44s] [delay 0s] req@ffff810049e06c00 x1373648715776149/t0(0) o-1->lustre-MDT0000_UUID@10.10.4.72@tcp:12/10 lens 304/304 e 0 to 1 dl 1310013672 ref 2 fl Rpc:XN/ffffffff/ffffffff rc 0/-1

which shows the rpc is timed out for 44 seconds. This causes the failure.

Comment by Zhenyu Xu [ 07/Jul/11 ]

client set request deadline in ptl_send_rpc()

        /* We give the server rq_timeout secs to process the req, and
           add the network latency for our local timeout. */
        request->rq_deadline = request->rq_sent + request->rq_timeout +
                ptlrpc_at_get_net_latency(request);

where rq_timeout is determined by server's request process time, in ptlrpc_at_set_req_timeout()

                at = &req->rq_import->imp_at;
                idx = import_at_get_index(req->rq_import,
                                          req->rq_request_portal);
                serv_est = at_get(&at->iat_service_estimate[idx]);
                req->rq_timeout = at_est2timeout(serv_est);

and at_est2timeout()

        /* add an arbitrary minimum: 125% +5 sec */
        return (val + (val >> 2) + 5);

so the test should sleep (10*1.25+5) + 10 ~= 28 seconds

Comment by Niu Yawei (Inactive) [ 07/Jul/11 ]

Seems the test is not we expected, the script set fail_loc to OBD_FAIL_MDS_QUOTACTL_NET, but I didn't find where we check the OBD_FAIL_MDS_QUOTACTL_NET in lustre code, it might be removed by someone mistakenly?

Comment by Zhenyu Xu [ 07/Jul/11 ]

patch tracking at http://review.whamcloud.com/1069

Comment by Zhenyu Xu [ 07/Jul/11 ]

niu,

please check out mdt_mds_ops[] and in it we defines

DEF_MDT_HNDL_F(0,                         QUOTACTL,     mdt_quotactl_handle)

it will be expanded to

DEF_HNDL(MDS, GETATTR, _NET, 0, QUOTACTL, mdt_quotactl_handle, &RQF_MDS_QUOTACTL)

and further expanded to

[MDS_QUOTACTL - MDS_QUOTACTL_GETATTR] = {                          \
        .mh_name    = QUOTACTL,                                             \
        .mh_fail_id = OBD_FAIL_MDS_QUOTACTL_NET,       \
        .mh_opc     = MDS_QUOTACTL,                              \
        .mh_flags   = 0,                                            \
        .mh_act     = mdt_quotactl_handle,                                               \
        .mh_fmt     = &RQF_MDS_QUOTACTL                                               \
}

and in mdt_req_handle(), the fail_loc is checked as follows

        if (OBD_FAIL_CHECK_ORSET(h->mh_fail_id, OBD_FAIL_ONCE))
                RETURN(0);
Comment by Niu Yawei (Inactive) [ 07/Jul/11 ]

ah, I see, thank you.

Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #197
LU-492 fix sanity-quota test_29

Oleg Drokin : ef803602bf2ee9ed6aabb09aafe23fe036e7b8b2
Files :

  • lustre/tests/sanity-quota.sh
Comment by Peter Jones [ 08/Jul/11 ]

Patch landed for 2.1

Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #197
LU-492 fix sanity-quota test_29

Oleg Drokin : ef803602bf2ee9ed6aabb09aafe23fe036e7b8b2
Files :

  • lustre/tests/sanity-quota.sh
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #197
LU-492 fix sanity-quota test_29

Oleg Drokin : ef803602bf2ee9ed6aabb09aafe23fe036e7b8b2
Files :

  • lustre/tests/sanity-quota.sh
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #197
LU-492 fix sanity-quota test_29

Oleg Drokin : ef803602bf2ee9ed6aabb09aafe23fe036e7b8b2
Files :

  • lustre/tests/sanity-quota.sh
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #197
LU-492 fix sanity-quota test_29

Oleg Drokin : ef803602bf2ee9ed6aabb09aafe23fe036e7b8b2
Files :

  • lustre/tests/sanity-quota.sh
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #197
LU-492 fix sanity-quota test_29

Oleg Drokin : ef803602bf2ee9ed6aabb09aafe23fe036e7b8b2
Files :

  • lustre/tests/sanity-quota.sh
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #197
LU-492 fix sanity-quota test_29

Oleg Drokin : ef803602bf2ee9ed6aabb09aafe23fe036e7b8b2
Files :

  • lustre/tests/sanity-quota.sh
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #197
LU-492 fix sanity-quota test_29

Oleg Drokin : ef803602bf2ee9ed6aabb09aafe23fe036e7b8b2
Files :

  • lustre/tests/sanity-quota.sh
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » i686,client,el5,ofa #197
LU-492 fix sanity-quota test_29

Oleg Drokin : ef803602bf2ee9ed6aabb09aafe23fe036e7b8b2
Files :

  • lustre/tests/sanity-quota.sh
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #197
LU-492 fix sanity-quota test_29

Oleg Drokin : ef803602bf2ee9ed6aabb09aafe23fe036e7b8b2
Files :

  • lustre/tests/sanity-quota.sh
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #197
LU-492 fix sanity-quota test_29

Oleg Drokin : ef803602bf2ee9ed6aabb09aafe23fe036e7b8b2
Files :

  • lustre/tests/sanity-quota.sh
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #197
LU-492 fix sanity-quota test_29

Oleg Drokin : ef803602bf2ee9ed6aabb09aafe23fe036e7b8b2
Files :

  • lustre/tests/sanity-quota.sh
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » i686,server,el5,ofa #197
LU-492 fix sanity-quota test_29

Oleg Drokin : ef803602bf2ee9ed6aabb09aafe23fe036e7b8b2
Files :

  • lustre/tests/sanity-quota.sh
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #197
LU-492 fix sanity-quota test_29

Oleg Drokin : ef803602bf2ee9ed6aabb09aafe23fe036e7b8b2
Files :

  • lustre/tests/sanity-quota.sh
Comment by Jay Lan (Inactive) [ 23/Apr/12 ]

I am running 2.1.1
(see https://github.com/jlan/lustre-nas/tree/nas-2.1.1)
at both server and client ends. It contains the fix of this ticket.

However, it consistenly failed in my 2.1.1 server + 2.1.1 client test environment.

test_29()
{
...

  1. actually send a RPC to make service at_current confined within at_max
    $LFS setquota -u $TSTUSR -b 0 -B $BLK_LIMIT -i 0 -I 0 $DIR || error "should succeed"
    <=== succeeded

#define OBD_FAIL_MDS_QUOTACTL_NET 0x12e
lustre_fail mds 0x12e
<==== fine

$LFS setquota -u $TSTUSR -b 0 -B $BLK_LIMIT -i 0 -I 0 $DIR & pid=$!
<==== "setquota failed: Transport endpoint is not connected"

echo "sleeping for 10 * 1.25 + 5 + 10 seconds"
sleep 28
ps -p $pid && error "lfs hadn't finished by timeout"
<==== the process still alive. Die later due to timeout.
...

Is "setquota failed: Transport endpoint is not connected" error expected?
Was that the result of "lustre_fail mds 0x12e"?

I tried a "sleep 40" (instead of "sleep 28" after that, and the lfs
command timed out before the check and the test passed. It seems
the sleep formula "10 * 1.25 + 5 + 10 seconds" is not long enough?

Generated at Sat Feb 10 01:07:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.