Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8136

sanity-hsm test_9 fails with 'request on 0x200000405:0x4:0x0 is not SUCCEED on mds1'

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • Lustre 2.9.0
    • None
    • autotest review-dne
    • 3
    • 9223372036854775807

    Description

      sanity-hsm test 9 fails with

      'request on 0x200000405:0x4:0x0 is not SUCCEED on mds1'  
      

      The last thing seen in the test log before the failure is

      CMD: trevis-5vm4 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
      CMD: trevis-5vm4 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
      CMD: trevis-5vm4 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
      CMD: trevis-5vm4 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
      Update not seen after 200s: wanted 'SUCCEED' got 'STARTED'
       sanity-hsm test_9: @@@@@@ FAIL: request on 0x200000405:0x4:0x0 is not SUCCEED on mds1 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4769:error()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:766:wait_request_state()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:1010:test_9()
        = /usr/lib64/lustre/tests/test-framework.sh:5033:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:5072:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:4919:run_test()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:1016:main()
      Dumping lctl log to /logdir/test_logs/2016-05-11/lustre-reviews-el7-x86_64--review-dne-part-2--1_7_1__38816__-70227460739120-004004/sanity-hsm.test_9.*.1462941864.log
       

      Is this the same or similar issue as in LU-8111?

      So far, this test is only failing on review-dne-* tests groups. Test 9 started failing with this failure in the past two days;7 failures. Here are the failures:
      2016-05-10 - https://testing.hpdd.intel.com/test_sets/dd4e43d0-168a-11e6-855a-5254006e85c2
      2016-05-10 - https://testing.hpdd.intel.com/test_sets/ab71c010-16d2-11e6-855a-5254006e85c2
      2016-05-10 - https://testing.hpdd.intel.com/test_sets/6f746a34-1710-11e6-855a-5254006e85c2
      2016-05-10 - https://testing.hpdd.intel.com/test_sets/ae6d43d4-1713-11e6-9b34-5254006e85c2
      2016-05-11 - https://testing.hpdd.intel.com/test_sets/541a3da0-1732-11e6-b5f1-5254006e85c2
      2016-05-11 - https://testing.hpdd.intel.com/test_sets/0923a158-173f-11e6-855a-5254006e85c2
      2016-05-11 - https://testing.hpdd.intel.com/test_sets/6f95b094-1757-11e6-9b34-5254006e85c2

      Attachments

        Activity

          [LU-8136] sanity-hsm test_9 fails with 'request on 0x200000405:0x4:0x0 is not SUCCEED on mds1'

          Thanks Bruno,

          I don't understand how my patch can affect this functionality but I will look into this. I parallelize only regular I/O. Other I/O should use old pipeline. Even with parallel I/O we should not have multiple requests.

           

          dmiter Dmitry Eremin (Inactive) added a comment - Thanks Bruno, I don't understand how my patch can affect this functionality but I will look into this. I parallelize only regular I/O. Other I/O should use old pipeline. Even with parallel I/O we should not have multiple requests.  
          bfaccini Bruno Faccini (Inactive) added a comment - - edited

          Dmitry,
          I have checked the recent auto-tests results in Maloo and seems that all recent failures of sanity-hsm test_12*, test_33-36, test_57-58, test_110*, test_222* subtests have only occurred during the tests sessions for your patches for LU-8964 and also a few with patch for LU-8709, and always during implicit restore operations where their result interpretation leads to multiple statuses/lines (due to multiple requests??...).
          If I understand correctly both tickets/patches intent is to optimize/parallelize Client's regular/read-ahead I/O ops, so could it be possible that your code/patch, can cause multiple hsm_restore requests to be generated for the same file/fid, when only one will succeed ??
          If yes, we may only need to strengthen sanity-hsm/wait_request_state() function to filter uniquely the wanted status in its check command :

          [root@eagle-31 lustre-release]# diff -urpN /usr/lib64/lustre/tests/sanity-hsm.sh.bfi /usr/lib64/lustre/tests/sanity-hsm.sh.bfi+
          --- /usr/lib64/lustre/tests/sanity-hsm.sh.bfi   2016-12-28 16:41:09.000000000 +0000
          +++ /usr/lib64/lustre/tests/sanity-hsm.sh.bfi+  2016-12-29 10:40:08.000000000 +0000
          @@ -724,7 +724,8 @@ wait_request_state() {
                  local mds=mds$(($mdtidx + 1))
           
                  local cmd="$LCTL get_param -n ${MDT_PREFIX}${mdtidx}.hsm.actions"
          -       cmd+=" | awk '/'$fid'.*action='$request'/ {print \\\$13}' | cut -f2 -d="
          +       cmd+=" | awk '/'$fid'.*action='$request'/ {print \\\$13}' |\
          +            cut -f2 -d= | uniq | grep $state"
           
                  wait_result $mds "$cmd" $state 200 ||
                          error "request on $fid is not $state on $mds"
          [root@eagle-31 lustre-release]# 
          

          What do you think ??

          bfaccini Bruno Faccini (Inactive) added a comment - - edited Dmitry, I have checked the recent auto-tests results in Maloo and seems that all recent failures of sanity-hsm test_12*, test_33-36, test_57-58, test_110*, test_222* subtests have only occurred during the tests sessions for your patches for LU-8964 and also a few with patch for LU-8709 , and always during implicit restore operations where their result interpretation leads to multiple statuses/lines (due to multiple requests??...). If I understand correctly both tickets/patches intent is to optimize/parallelize Client's regular/read-ahead I/O ops, so could it be possible that your code/patch, can cause multiple hsm_restore requests to be generated for the same file/fid, when only one will succeed ?? If yes, we may only need to strengthen sanity-hsm/wait_request_state() function to filter uniquely the wanted status in its check command : [root@eagle-31 lustre-release]# diff -urpN /usr/lib64/lustre/tests/sanity-hsm.sh.bfi /usr/lib64/lustre/tests/sanity-hsm.sh.bfi+ --- /usr/lib64/lustre/tests/sanity-hsm.sh.bfi 2016-12-28 16:41:09.000000000 +0000 +++ /usr/lib64/lustre/tests/sanity-hsm.sh.bfi+ 2016-12-29 10:40:08.000000000 +0000 @@ -724,7 +724,8 @@ wait_request_state() { local mds=mds$(($mdtidx + 1)) local cmd="$LCTL get_param -n ${MDT_PREFIX}${mdtidx}.hsm.actions" - cmd+=" | awk '/'$fid'.*action='$request'/ {print \\\$13}' | cut -f2 -d=" + cmd+=" | awk '/'$fid'.*action='$request'/ {print \\\$13}' |\ + cut -f2 -d= | uniq | grep $state" wait_result $mds "$cmd" $state 200 || error "request on $fid is not $state on $mds" [root@eagle-31 lustre-release]# What do you think ??
          bfaccini Bruno Faccini (Inactive) added a comment - - edited

          Well having a look to the recent failed auto-tests logs, it seems that their problem is not the one (CT failing to register with all MDTs in a too short time, requiring to add a tempo) that has been tracked in this ticket.
          My feeling comes from the same set of following lines :

          CMD: trevis-35vm7 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          Changed after 16s: from 'SUCCEED
          FAILED' to ''
          

          that can be found in each of these tests main log showing that the SUCCEED request state may not have been detected due to some (new?) issue in the wait_request_state() function when interpreting "hsm/actions" proc file output.

          bfaccini Bruno Faccini (Inactive) added a comment - - edited Well having a look to the recent failed auto-tests logs, it seems that their problem is not the one (CT failing to register with all MDTs in a too short time, requiring to add a tempo) that has been tracked in this ticket. My feeling comes from the same set of following lines : CMD: trevis-35vm7 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Changed after 16s: from 'SUCCEED FAILED' to '' that can be found in each of these tests main log showing that the SUCCEED request state may not have been detected due to some (new?) issue in the wait_request_state() function when interpreting "hsm/actions" proc file output.

          Well, may be the tempo and CT registration verification should be, as John H had already suggested when commenting my first patch, generalized (in copytool_setup()?).
          I will check the auto-tests links you have provided and see if they show the same behavior than the original ones.

          bfaccini Bruno Faccini (Inactive) added a comment - Well, may be the tempo and CT registration verification should be, as John H had already suggested when commenting my first patch, generalized (in copytool_setup()?). I will check the auto-tests links you have provided and see if they show the same behavior than the original ones.
          dmiter Dmitry Eremin (Inactive) added a comment - - edited The same failure now happens with test_12*, test_33-36, test_57-58, test_110*, test_222* https://testing.hpdd.intel.com/test_sets/7c0d8752-cc8d-11e6-9816-5254006e85c2 https://testing.hpdd.intel.com/test_sets/85ff554c-ccd8-11e6-9296-5254006e85c2  
          pjones Peter Jones added a comment -

          Landed for 2.9

          pjones Peter Jones added a comment - Landed for 2.9

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20258/
          Subject: LU-8136 tests: allow CT to registers with all MDTs
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: f56b701235674c8661691d73f55570c161fa13cb

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20258/ Subject: LU-8136 tests: allow CT to registers with all MDTs Project: fs/lustre-release Branch: master Current Patch Set: Commit: f56b701235674c8661691d73f55570c161fa13cb

          Hit similar problem on our local autotest:
          Tue May 31 11:28:27 JST 2016: waited 200 secs for update
          Update not seen after 200s: wanted 'SUCCEED' got 'STARTED'
          sanity-hsm test_12g: @@@@@@ FAIL: request on 0x280000401:0xe:0x0 is not SUCCEED on mds1
          Trace dump:
          = /usr/lib64/lustre/tests/test-framework.sh:4730:error_noexit()
          = /usr/lib64/lustre/tests/test-framework.sh:4761:error()
          = /usr/lib64/lustre/tests/sanity-hsm.sh:704:wait_request_state()
          = /usr/lib64/lustre/tests/sanity-hsm.sh:1278:test_12g()
          = /usr/lib64/lustre/tests/test-framework.sh:5008:run_one()
          = /usr/lib64/lustre/tests/test-framework.sh:5045:run_one_logged()
          = /usr/lib64/lustre/tests/test-framework.sh:4910:run_test()
          = /usr/lib64/lustre/tests/sanity-hsm.sh:1291:main()
          so I am wondering whether we also add the wait step into other sub-cases, if it happened to test_9, it may happen to other cases too, e.g. the above 12g

          cengku9660 Gu Zheng (Inactive) added a comment - Hit similar problem on our local autotest: Tue May 31 11:28:27 JST 2016: waited 200 secs for update Update not seen after 200s: wanted 'SUCCEED' got 'STARTED' sanity-hsm test_12g: @@@@@@ FAIL: request on 0x280000401:0xe:0x0 is not SUCCEED on mds1 Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4730:error_noexit() = /usr/lib64/lustre/tests/test-framework.sh:4761:error() = /usr/lib64/lustre/tests/sanity-hsm.sh:704:wait_request_state() = /usr/lib64/lustre/tests/sanity-hsm.sh:1278:test_12g() = /usr/lib64/lustre/tests/test-framework.sh:5008:run_one() = /usr/lib64/lustre/tests/test-framework.sh:5045:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:4910:run_test() = /usr/lib64/lustre/tests/sanity-hsm.sh:1291:main() so I am wondering whether we also add the wait step into other sub-cases, if it happened to test_9, it may happen to other cases too, e.g. the above 12g

          Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/20258
          Subject: LU-8136 tests: allow CT to registers with all MDTs
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: b9f23bd58cedf93898bb8a0854f12d181bb0f39b

          gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/20258 Subject: LU-8136 tests: allow CT to registers with all MDTs Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b9f23bd58cedf93898bb8a0854f12d181bb0f39b

          James,
          Well, I had a look to the last/most recent (2016-05-11 - https://testing.hpdd.intel.com/test_sets/6f95b094-1757-11e6-9b34-5254006e85c2) failure you have reported, and looks like according to the 1st MDS (handling MDT1 and MDT3) there could be a slow start of CT (likely due to the need to register with each MDTs for/when DNE), that may allow the archive request to be somewhat trashed/unseen on CT side after its startup has completed :

          CT registers with 1st MDT:
          00000100:00100000:0.0:1462941602.244264:0:5563:0:(service.c:2070:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:50b3d422-923b-abd9-9810-4f7c10608c4b+8:17919:x1533991955961472:12345-10.9.4.43@tcp:59
          00000100:00100000:0.0:1462941602.244278:0:5563:0:(service.c:2120:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:50b3d422-923b-abd9-9810-4f7c10608c4b+8:17919:x1533991955961472:12345-10.9.4.43@tcp:59 Request procesed in 13us (26us total) trans 0 rc 0/0
          00000100:00100000:0.0:1462941602.244280:0:5563:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.9.4.43@tcp, seq: 308
          00000100:00100000:0.0:1462941602.245002:0:6662:0:(events.c:351:request_in_callback()) peer: 12345-10.9.4.42@tcp
          00000100:00100000:0.0:1462941602.245007:0:5563:0:(service.c:1922:ptlrpc_server_handle_req_in()) got req x1534001997146880
          00000100:00100000:0.0:1462941602.245012:0:5563:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.9.4.42@tcp, seq: 309
          Client sends hsm_archive request to MDT/CDT:
          00000100:00100000:0.0:1462941602.245014:0:5563:0:(service.c:2070:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:bed34d71-9fc0-26f6-02c9-3e3df67d2b69+38:22988:x1534001997146880:12345-10.9.4.42@tcp:58
          00000040:00080000:0.0:1462941602.245032:0:5563:0:(llog_cat.c:735:llog_cat_process_cb()) processing log 0x15:1:0 at index 1 of catalog 0x8:10
          00000040:00080000:0.0:1462941602.245138:0:5563:0:(llog_osd.c:696:llog_osd_write_rec()) added record [0x1:0x15:0x0]: idx: 4, 136 off8736
          00000100:00100000:0.0:1462941602.245149:0:5563:0:(service.c:2120:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:bed34d71-9fc0-26f6-02c9-3e3df67d2b69+38:22988:x1534001997146880:12345-10.9.4.42@tcp:58 Request procesed in 135us (148us total) trans 0 rc 0/0
          00000100:00100000:0.0:1462941602.245152:0:5563:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.9.4.42@tcp, seq: 309
          00000040:00080000:0.0:1462941602.245161:0:12867:0:(llog_cat.c:735:llog_cat_process_cb()) processing log 0x15:1:0 at index 1 of catalog 0x8:10
          00000040:00100000:0.0:1462941602.245165:0:12867:0:(llog.c:211:llog_cancel_rec()) Canceling 2 in log 0x15:1
          00000040:00100000:0.0:1462941602.245172:0:12867:0:(llog.c:211:llog_cancel_rec()) Canceling 3 in log 0x15:1
          MDT/CDT sends archive request to CT:
          00000100:00100000:0.0:1462941602.245214:0:12867:0:(client.c:1589:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc hsm_cdtr:lustre-MDT0000_UUID:12867:1533991886170496:10.9.4.43@tcp:107
          00000100:00100000:0.0:1462941602.245224:0:12867:0:(client.c:2287:ptlrpc_set_wait()) set ffff8800482b46c0 going to sleep for 11 seconds
          00000100:00100000:0.0:1462941602.245250:0:6662:0:(events.c:351:request_in_callback()) peer: 12345-10.9.4.43@tcp
          00000100:00100000:0.0:1462941602.245353:0:5563:0:(service.c:1922:ptlrpc_server_handle_req_in()) got req x1533991955961504
          00000100:00100000:0.0:1462941602.245361:0:5563:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.9.4.43@tcp, seq: 310
          CT registers with 2nd MDT:
          00000100:00100000:0.0:1462941602.245363:0:5563:0:(service.c:2070:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:50b3d422-923b-abd9-9810-4f7c10608c4b+5:17919:x1533991955961504:12345-10.9.4.43@tcp:59
          00000100:00100000:0.0:1462941602.245377:0:5563:0:(service.c:2120:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:50b3d422-923b-abd9-9810-4f7c10608c4b+5:17919:x1533991955961504:12345-10.9.4.43@tcp:59 Request procesed in 14us (127us total) trans 0 rc 0/0
          00000100:00100000:0.0:1462941602.245379:0:5563:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.9.4.43@tcp, seq: 310
          Agent answers (after trashing because of some race during CT start?) to MDT/CDT archive request:
          00000100:00100000:0.0:1462941602.245679:0:12867:0:(client.c:1997:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc hsm_cdtr:lustre-MDT0000_UUID:12867:1533991886170496:10.9.4.43@tcp:107
          00000040:00080000:0.0:1462941602.245689:0:12867:0:(llog_cat.c:735:llog_cat_process_cb()) processing log 0x15:1:0 at index 1 of catalog 0x8:10
          00000001:02000400:0.0:1462941602.356340:0:13401:0:(debug.c:335:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x4:0x0'.*action='ARCHIVE'/ {print $13}' | cut -f2 -d=
          

          By the way, and even if a possible racy situation will need to be investigated further, we may already avoid this by moving CT start at the beginning of sanity-hsm/test_9() vs currently :

          test_9() {
                  mkdir -p $DIR/$tdir
                  local f=$DIR/$tdir/$tfile
                  local fid=$(copy_file /etc/passwd $f)
                  # we do not use the default one to be sure
                  local new_an=$((HSM_ARCHIVE_NUMBER + 1))
                  copytool_cleanup
                  copytool_setup $SINGLEAGT $MOUNT $new_an
                  $LFS hsm_archive --archive $new_an $f
                  wait_request_state $fid ARCHIVE SUCCEED
          
                  check_hsm_flags $f "0x00000009"
          
                  copytool_cleanup
          }
          run_test 9 "Use of explicit archive number, with dedicated copytool"
          

          I will push a first patch in this direction and also continue to investigate to better understand the possible implication of DNE configuration vs CT startup.

          bfaccini Bruno Faccini (Inactive) added a comment - James, Well, I had a look to the last/most recent (2016-05-11 - https://testing.hpdd.intel.com/test_sets/6f95b094-1757-11e6-9b34-5254006e85c2 ) failure you have reported, and looks like according to the 1st MDS (handling MDT1 and MDT3) there could be a slow start of CT (likely due to the need to register with each MDTs for/when DNE), that may allow the archive request to be somewhat trashed/unseen on CT side after its startup has completed : CT registers with 1st MDT: 00000100:00100000:0.0:1462941602.244264:0:5563:0:(service.c:2070:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:50b3d422-923b-abd9-9810-4f7c10608c4b+8:17919:x1533991955961472:12345-10.9.4.43@tcp:59 00000100:00100000:0.0:1462941602.244278:0:5563:0:(service.c:2120:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:50b3d422-923b-abd9-9810-4f7c10608c4b+8:17919:x1533991955961472:12345-10.9.4.43@tcp:59 Request procesed in 13us (26us total) trans 0 rc 0/0 00000100:00100000:0.0:1462941602.244280:0:5563:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.9.4.43@tcp, seq: 308 00000100:00100000:0.0:1462941602.245002:0:6662:0:(events.c:351:request_in_callback()) peer: 12345-10.9.4.42@tcp 00000100:00100000:0.0:1462941602.245007:0:5563:0:(service.c:1922:ptlrpc_server_handle_req_in()) got req x1534001997146880 00000100:00100000:0.0:1462941602.245012:0:5563:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.9.4.42@tcp, seq: 309 Client sends hsm_archive request to MDT/CDT: 00000100:00100000:0.0:1462941602.245014:0:5563:0:(service.c:2070:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:bed34d71-9fc0-26f6-02c9-3e3df67d2b69+38:22988:x1534001997146880:12345-10.9.4.42@tcp:58 00000040:00080000:0.0:1462941602.245032:0:5563:0:(llog_cat.c:735:llog_cat_process_cb()) processing log 0x15:1:0 at index 1 of catalog 0x8:10 00000040:00080000:0.0:1462941602.245138:0:5563:0:(llog_osd.c:696:llog_osd_write_rec()) added record [0x1:0x15:0x0]: idx: 4, 136 off8736 00000100:00100000:0.0:1462941602.245149:0:5563:0:(service.c:2120:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:bed34d71-9fc0-26f6-02c9-3e3df67d2b69+38:22988:x1534001997146880:12345-10.9.4.42@tcp:58 Request procesed in 135us (148us total) trans 0 rc 0/0 00000100:00100000:0.0:1462941602.245152:0:5563:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.9.4.42@tcp, seq: 309 00000040:00080000:0.0:1462941602.245161:0:12867:0:(llog_cat.c:735:llog_cat_process_cb()) processing log 0x15:1:0 at index 1 of catalog 0x8:10 00000040:00100000:0.0:1462941602.245165:0:12867:0:(llog.c:211:llog_cancel_rec()) Canceling 2 in log 0x15:1 00000040:00100000:0.0:1462941602.245172:0:12867:0:(llog.c:211:llog_cancel_rec()) Canceling 3 in log 0x15:1 MDT/CDT sends archive request to CT: 00000100:00100000:0.0:1462941602.245214:0:12867:0:(client.c:1589:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc hsm_cdtr:lustre-MDT0000_UUID:12867:1533991886170496:10.9.4.43@tcp:107 00000100:00100000:0.0:1462941602.245224:0:12867:0:(client.c:2287:ptlrpc_set_wait()) set ffff8800482b46c0 going to sleep for 11 seconds 00000100:00100000:0.0:1462941602.245250:0:6662:0:(events.c:351:request_in_callback()) peer: 12345-10.9.4.43@tcp 00000100:00100000:0.0:1462941602.245353:0:5563:0:(service.c:1922:ptlrpc_server_handle_req_in()) got req x1533991955961504 00000100:00100000:0.0:1462941602.245361:0:5563:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.9.4.43@tcp, seq: 310 CT registers with 2nd MDT: 00000100:00100000:0.0:1462941602.245363:0:5563:0:(service.c:2070:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:50b3d422-923b-abd9-9810-4f7c10608c4b+5:17919:x1533991955961504:12345-10.9.4.43@tcp:59 00000100:00100000:0.0:1462941602.245377:0:5563:0:(service.c:2120:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:50b3d422-923b-abd9-9810-4f7c10608c4b+5:17919:x1533991955961504:12345-10.9.4.43@tcp:59 Request procesed in 14us (127us total) trans 0 rc 0/0 00000100:00100000:0.0:1462941602.245379:0:5563:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.9.4.43@tcp, seq: 310 Agent answers (after trashing because of some race during CT start?) to MDT/CDT archive request: 00000100:00100000:0.0:1462941602.245679:0:12867:0:(client.c:1997:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc hsm_cdtr:lustre-MDT0000_UUID:12867:1533991886170496:10.9.4.43@tcp:107 00000040:00080000:0.0:1462941602.245689:0:12867:0:(llog_cat.c:735:llog_cat_process_cb()) processing log 0x15:1:0 at index 1 of catalog 0x8:10 00000001:02000400:0.0:1462941602.356340:0:13401:0:(debug.c:335:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x4:0x0'.*action='ARCHIVE'/ {print $13}' | cut -f2 -d= By the way, and even if a possible racy situation will need to be investigated further, we may already avoid this by moving CT start at the beginning of sanity-hsm/test_9() vs currently : test_9() { mkdir -p $DIR/$tdir local f=$DIR/$tdir/$tfile local fid=$(copy_file /etc/passwd $f) # we do not use the default one to be sure local new_an=$((HSM_ARCHIVE_NUMBER + 1)) copytool_cleanup copytool_setup $SINGLEAGT $MOUNT $new_an $LFS hsm_archive --archive $new_an $f wait_request_state $fid ARCHIVE SUCCEED check_hsm_flags $f "0x00000009" copytool_cleanup } run_test 9 "Use of explicit archive number, with dedicated copytool" I will push a first patch in this direction and also continue to investigate to better understand the possible implication of DNE configuration vs CT startup.

          People

            wc-triage WC Triage
            jamesanunez James Nunez (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: