Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.1.1
-
None
-
Server: rhel6.2 with lustre-2.1.1
Client: rhel6.2 with lustre-client-2.1.1
MDS/MGS: service360
OSS1: service361
OSS2: service362
Client1: service333
Client2: service334
-
3
-
4547
Description
This looks like a duplicate of LU-492, but my software contains the fix of LU-492. The patch of LU-492 did not help in my testing.
The git source of our code is at https://github.com/jlan/lustre-nas/tree/nas-2.1.1
The command I issued was:
- ONLY=29 cfg/nas.v3.sh SANITY_QUOTA
The script files nas.v3.sh and ncli_nas.v3.sh are attached.
The test log tarball sanity-quota-1335289931.tar.bz2 is also attached.
The failure is reproducible.
test_29()
{
...
- actually send a RPC to make service at_current confined within at_max
$LFS setquota -u $TSTUSR -b 0 -B $BLK_LIMIT -i 0 -I 0 $DIR || error "should succeed"
<=== succeeded
#define OBD_FAIL_MDS_QUOTACTL_NET 0x12e
lustre_fail mds 0x12e
<==== fine
$LFS setquota -u $TSTUSR -b 0 -B $BLK_LIMIT -i 0 -I 0 $DIR & pid=$!
<==== "setquota failed: Transport endpoint is not connected"
echo "sleeping for 10 * 1.25 + 5 + 10 seconds"
sleep 28
ps -p $pid && error "lfs hadn't finished by timeout"
<==== the process still alive. Die later due to timeout.
...
Is "setquota failed: Transport endpoint is not connected" error expected?
I saw that in the test log.
Was that the result of "lustre_fail mds 0x12e", or did that mean the mds did not see the lustre_fail request? Remote commands were sent via pdsh.
If I tried a "sleep 40" (instead of "sleep 28" after that, the lfs
command timed out before the check and the test passed. It seems
the sleep formula "10 * 1.25 + 5 + 10 seconds" is not long enough?