Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1342

Test failure on sanity-quota test_29

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.1.4
    • Lustre 2.1.1
    • None
    • Server: rhel6.2 with lustre-2.1.1
      Client: rhel6.2 with lustre-client-2.1.1

      MDS/MGS: service360
      OSS1: service361
      OSS2: service362
      Client1: service333
      Client2: service334
    • 3
    • 4547

    Description

      This looks like a duplicate of LU-492, but my software contains the fix of LU-492. The patch of LU-492 did not help in my testing.

      The git source of our code is at https://github.com/jlan/lustre-nas/tree/nas-2.1.1

      The command I issued was:

      1. ONLY=29 cfg/nas.v3.sh SANITY_QUOTA
        The script files nas.v3.sh and ncli_nas.v3.sh are attached.
        The test log tarball sanity-quota-1335289931.tar.bz2 is also attached.

      The failure is reproducible.

      test_29()
      {
      ...

      1. actually send a RPC to make service at_current confined within at_max
        $LFS setquota -u $TSTUSR -b 0 -B $BLK_LIMIT -i 0 -I 0 $DIR || error "should succeed"
        <=== succeeded

      #define OBD_FAIL_MDS_QUOTACTL_NET 0x12e
      lustre_fail mds 0x12e
      <==== fine

      $LFS setquota -u $TSTUSR -b 0 -B $BLK_LIMIT -i 0 -I 0 $DIR & pid=$!
      <==== "setquota failed: Transport endpoint is not connected"

      echo "sleeping for 10 * 1.25 + 5 + 10 seconds"
      sleep 28
      ps -p $pid && error "lfs hadn't finished by timeout"
      <==== the process still alive. Die later due to timeout.
      ...

      Is "setquota failed: Transport endpoint is not connected" error expected?
      I saw that in the test log.
      Was that the result of "lustre_fail mds 0x12e", or did that mean the mds did not see the lustre_fail request? Remote commands were sent via pdsh.

      If I tried a "sleep 40" (instead of "sleep 28" after that, the lfs
      command timed out before the check and the test passed. It seems
      the sleep formula "10 * 1.25 + 5 + 10 seconds" is not long enough?

      Attachments

        1. nas.v3.sh
          3 kB
        2. ncli_nas.v3.sh
          2 kB
        3. sanity-quota-1335289931.tar.bz2
          4.61 MB

        Activity

          [LU-1342] Test failure on sanity-quota test_29
          pjones Peter Jones added a comment -

          Landed for 2.1.4 and 2.4

          pjones Peter Jones added a comment - Landed for 2.1.4 and 2.4
          bobijam Zhenyu Xu added a comment -

          b2_1 patch port tracking at http://review.whamcloud.com/3870

          bobijam Zhenyu Xu added a comment - b2_1 patch port tracking at http://review.whamcloud.com/3870

          The patch set #7 was landed to master on July 12.
          Can we landed this patch to b2_1 branch and close the ticket?

          jaylan Jay Lan (Inactive) added a comment - The patch set #7 was landed to master on July 12. Can we landed this patch to b2_1 branch and close the ticket?
          pjones Peter Jones added a comment -

          > How do I add comment to http://review.whamcloud.com/2601 ?

          I would guess that the missing step would be to login to gerrit...

          pjones Peter Jones added a comment - > How do I add comment to http://review.whamcloud.com/2601 ? I would guess that the missing step would be to login to gerrit...

          How do I add comment to http://review.whamcloud.com/2601 ?

          The Patch Set 2 did work in my environment. However, remember that my problem
          was because test_29 lowered at_max on client to a value smaller than at_min. Our site
          set at_min to 15, so deadline=56 would work, but theoretically it can fail at a
          site with a higher at_min value.

          BTW, I understand the first 10 of the formula "2 * (10 * 1.25 + 5 + 10)" is the
          timeout value test_29 wants to set. I guess the second 10 is not the same timeout
          value; otherwise, you would simply do 10 * 2.25 So, what is the second 10?

          jaylan Jay Lan (Inactive) added a comment - How do I add comment to http://review.whamcloud.com/2601 ? The Patch Set 2 did work in my environment. However, remember that my problem was because test_29 lowered at_max on client to a value smaller than at_min. Our site set at_min to 15, so deadline=56 would work, but theoretically it can fail at a site with a higher at_min value. BTW, I understand the first 10 of the formula "2 * (10 * 1.25 + 5 + 10)" is the timeout value test_29 wants to set. I guess the second 10 is not the same timeout value; otherwise, you would simply do 10 * 2.25 So, what is the second 10?
          bobijam Zhenyu Xu added a comment -

          The sleep time should be for the worst case, and I think I can improve the test script by checking hte lfs process before the deadline, and that would be better.

          bobijam Zhenyu Xu added a comment - The sleep time should be for the worst case, and I think I can improve the test script by checking hte lfs process before the deadline, and that would be better.

          I found the problem.

          We set at_min=15 in our systems in addition to at_max.
          The test_29 reset at_max to 10, but left at_min to a higher value, thus
          forced timeout value to be 15 instead of 10.

          The test should save both at_max and at_min before the test, and restore
          them back after the test.

          jaylan Jay Lan (Inactive) added a comment - I found the problem. We set at_min=15 in our systems in addition to at_max. The test_29 reset at_max to 10, but left at_min to a higher value, thus forced timeout value to be 15 instead of 10. The test should save both at_max and at_min before the test, and restore them back after the test.

          "setquota failed: Transport endpoint is not connected" is the expected error.
          This is the error reported when the client rpc fails within the sleep time.
          The write to fail_loc on your server is happening correctly. It causes the MDS_QUOTACTL rpc handled on the server to abort & not ever return a reply. This is shown by the following from the server log:

          00000100:00100000:7.0:1335289642.393283:0:14946:0:(service.c:1536:ptlrpc_server_handle_req_in()) got req x1398863346727737
          00000100:00100000:7.0:1335289642.393297:0:14946:0:(service.c:1713:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_01:d1093c79-65be-f3fa-3770-e19614ebeee7+6:31674:x1398863346727737:12345-10.151.25.184@o2ib:48
          00000004:00020000:7.0:1335289642.393301:0:14946:0:(libcfs_fail.h:81:cfs_fail_check_set()) *** cfs_fail_loc=12e ***
          00000100:00100000:7.0:1335289642.415971:0:14946:0:(service.c:1760:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_01:d1093c79-65be-f3fa-3770-e19614ebeee7+5:31674:x1398863346727737:12345-10.151.25.184@o2ib:48 Request procesed in 22680us (22700us total) trans 0 rc 0/-999

          In the client log I see the following related to the failing rpc (xid = 1398863346727737, opcode = 48 = MDS_QUOTACTL)

          00000100:00100000:5.0:1335289903.436104:0:31674:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc lfs:d1093c79-65be-f3fa-3770-e19614ebeee7:31674:1398863346727737:10.151.26.38@o2ib:48
          00000100:00100000:5.0:1335289903.436135:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 23 seconds
          00000100:00100000:5.0:1335289926.435314:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds
          00000100:00100000:5.0:1335289927.435311:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds
          00000100:00100000:5.0:1335289928.435312:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds
          00000100:00100000:5.0:1335289929.435312:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds
          00000100:00100000:5.0:1335289930.435311:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds
          00000100:00100000:5.0:1335289931.435300:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds

          The initial sleep of 23 sec shown seems excessively high for the timeout of 10 set by the test script. The formula mentioned in the script comment of makes it seem like the number should be more like 17 or 18 ( 10 * 1.25 + 5 ). I'm wondering if there are some other settings in your environment forcing the rpc timeouts to be higher than normal, for example ldlm_timeout or timeouts related to your interconnect (IB).

          In attempting to reproduce this failure locally with tcp interconnect I find my lfs process timing out and returning in way under 10 secs every time. It never comes close to reaching the 28 sec sleep in the test script.

          bogl Bob Glossman (Inactive) added a comment - "setquota failed: Transport endpoint is not connected" is the expected error. This is the error reported when the client rpc fails within the sleep time. The write to fail_loc on your server is happening correctly. It causes the MDS_QUOTACTL rpc handled on the server to abort & not ever return a reply. This is shown by the following from the server log: 00000100:00100000:7.0:1335289642.393283:0:14946:0:(service.c:1536:ptlrpc_server_handle_req_in()) got req x1398863346727737 00000100:00100000:7.0:1335289642.393297:0:14946:0:(service.c:1713:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_01:d1093c79-65be-f3fa-3770-e19614ebeee7+6:31674:x1398863346727737:12345-10.151.25.184@o2ib:48 00000004:00020000:7.0:1335289642.393301:0:14946:0:(libcfs_fail.h:81:cfs_fail_check_set()) *** cfs_fail_loc=12e *** 00000100:00100000:7.0:1335289642.415971:0:14946:0:(service.c:1760:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_01:d1093c79-65be-f3fa-3770-e19614ebeee7+5:31674:x1398863346727737:12345-10.151.25.184@o2ib:48 Request procesed in 22680us (22700us total) trans 0 rc 0/-999 In the client log I see the following related to the failing rpc (xid = 1398863346727737, opcode = 48 = MDS_QUOTACTL) 00000100:00100000:5.0:1335289903.436104:0:31674:0:(client.c:1395:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc lfs:d1093c79-65be-f3fa-3770-e19614ebeee7:31674:1398863346727737:10.151.26.38@o2ib:48 00000100:00100000:5.0:1335289903.436135:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 23 seconds 00000100:00100000:5.0:1335289926.435314:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds 00000100:00100000:5.0:1335289927.435311:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds 00000100:00100000:5.0:1335289928.435312:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds 00000100:00100000:5.0:1335289929.435312:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds 00000100:00100000:5.0:1335289930.435311:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds 00000100:00100000:5.0:1335289931.435300:0:31674:0:(client.c:1993:ptlrpc_set_wait()) set ffff8801940b8640 going to sleep for 1 seconds The initial sleep of 23 sec shown seems excessively high for the timeout of 10 set by the test script. The formula mentioned in the script comment of makes it seem like the number should be more like 17 or 18 ( 10 * 1.25 + 5 ). I'm wondering if there are some other settings in your environment forcing the rpc timeouts to be higher than normal, for example ldlm_timeout or timeouts related to your interconnect (IB). In attempting to reproduce this failure locally with tcp interconnect I find my lfs process timing out and returning in way under 10 secs every time. It never comes close to reaching the 28 sec sleep in the test script.
          bobijam Zhenyu Xu added a comment -

          Guess the test needs take the net latency into the wait time value.

          Patch tracking at http://review.whamcloud.com/2601

          bobijam Zhenyu Xu added a comment - Guess the test needs take the net latency into the wait time value. Patch tracking at http://review.whamcloud.com/2601
          pjones Peter Jones added a comment -

          Bobi

          Could you please look into this one?

          Thanks

          Peter

          pjones Peter Jones added a comment - Bobi Could you please look into this one? Thanks Peter

          People

            bobijam Zhenyu Xu
            jaylan Jay Lan (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: