Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7273

sanity-hsm test_107 failed with 'Copytool failed to stop in 20s ...'

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.8.0
    • Lustre 2.8.0
    • None
    • autotest
    • 3
    • 9223372036854775807

    Description

      sanity-hsm test 107 failed with error

      'Copytool failed to stop in 20s ...' 
      

      Logs are at https://testing.hpdd.intel.com/test_sets/070dca4c-6d12-11e5-ab7f-5254006e85c2

      There are no errors in the logs. The last things seen in the test_log are:

      Copytool still running on shadow-22vm6
      CMD: shadow-22vm6 pgrep -x lhsmtool_posix
      shadow-22vm6: 24751
      Copytool still running on shadow-22vm6
      CMD: shadow-22vm6 pgrep -x lhsmtool_posix
      shadow-22vm6: 24751
       sanity-hsm test_107: @@@@@@ FAIL: Copytool failed to stop in 20s ... 
      

      Same error has been hit in with this test in the past. One example is https://testing.hpdd.intel.com/test_sets/3bd00e02-6a47-11e5-9a21-5254006e85c2

      Attachments

        Issue Links

          Activity

            [LU-7273] sanity-hsm test_107 failed with 'Copytool failed to stop in 20s ...'

            Landed for 2.8

            jgmitter Joseph Gmitter (Inactive) added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16782/
            Subject: LU-7273 tests: dump stacks upon CT stop failure
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 32dcb2a411163419fd7ab88d1442ae17d3e501cc

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16782/ Subject: LU-7273 tests: dump stacks upon CT stop failure Project: fs/lustre-release Branch: master Current Patch Set: Commit: 32dcb2a411163419fd7ab88d1442ae17d3e501cc

            The Copytool log shows that both archive actions in tes_107 have successfully completed.

            The Lustre debug log taken on Agent node seems to indicate that the main copytool thread was still running at the end of the 20s grace period allowed in copytool_cleanup(). More strange is that there are 2 log lines (ll_file_release()) for the single close() of mnt_fd in llapi_hsm_copytool_unregister() :

            00000001:02000400:0.0:1444222093.159584:0:25285:0:(debug.c:335:libcfs_debug_mark_buffer()) DEBUG MARKER: pkill -INT -x lhsmtool_posix
            00000080:00200000:1.0:1444222093.348446:0:24751:0:(dir.c:1099:ll_dir_ioctl()) VFS Op:inode=[0x200000007:0x1:0x0](ffff88007c2c37c0), cmd=0x401866d5
            00000100:00100000:1.0:1444222093.348503:0:24751:0:(client.c:1627:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc lhsmtool_posix:706467d4-9a29-db95-1bb0-fb11e4273b13:24751:1514365981493076:10.1.5.16@tcp:60
            00000100:00100000:1.0:1444222093.348551:0:24751:0:(client.c:2308:ptlrpc_set_wait()) set ffff88003733d2c0 going to sleep for 6 seconds
            00010000:00080000:1.0:1444222093.349748:0:24751:0:(ldlm_request.c:1307:ldlm_cli_update_pool()) @@@ Zero SLV or Limit found (SLV: 3378600000, Limit: 0)  req@ffff880053624080 x1514365981493076/t0(0) o60->lustre-MDT0000-mdc-ffff88007ad1d000@10.1.5.16@tcp:12/10 lens 440/192 e 0 to 0 dl 1444222100 ref 2 fl Rpc:R/0/0 rc 0/0
            00000100:00100000:1.0:1444222093.349766:0:24751:0:(client.c:2022:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc lhsmtool_posix:706467d4-9a29-db95-1bb0-fb11e4273b13:24751:1514365981493076:10.1.5.16@tcp:60
            00000100:00100000:1.0:1444222093.349815:0:24751:0:(client.c:1627:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc lhsmtool_posix:706467d4-9a29-db95-1bb0-fb11e4273b13:24751:1514365981493080:10.1.5.12@tcp:60
            00000100:00100000:1.0:1444222093.349833:0:24751:0:(client.c:2308:ptlrpc_set_wait()) set ffff88003733d2c0 going to sleep for 6 seconds
            00010000:00080000:0.0:1444222093.351595:0:24751:0:(ldlm_request.c:1307:ldlm_cli_update_pool()) @@@ Zero SLV or Limit found (SLV: 0, Limit: 31283)  req@ffff880053624080 x1514365981493080/t0(0) o60->lustre-MDT0001-mdc-ffff88007ad1d000@10.1.5.12@tcp:12/10 lens 440/192 e 0 to 0 dl 1444222100 ref 2 fl Rpc:R/0/0 rc 0/0
            00000100:00100000:0.0:1444222093.351611:0:24751:0:(client.c:2022:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc lhsmtool_posix:706467d4-9a29-db95-1bb0-fb11e4273b13:24751:1514365981493080:10.1.5.12@tcp:60
            00000100:00100000:0.0:1444222093.351627:0:24751:0:(client.c:1627:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc lhsmtool_posix:706467d4-9a29-db95-1bb0-fb11e4273b13:24751:1514365981493084:10.1.5.12@tcp:60
            00000100:00100000:0.0:1444222093.351641:0:24751:0:(client.c:2308:ptlrpc_set_wait()) set ffff88003733d2c0 going to sleep for 6 seconds
            00010000:00080000:0.0:1444222093.352420:0:24751:0:(ldlm_request.c:1307:ldlm_cli_update_pool()) @@@ Zero SLV or Limit found (SLV: 0, Limit: 31283)  req@ffff880053624080 x1514365981493084/t0(0) o60->lustre-MDT0002-mdc-ffff88007ad1d000@10.1.5.12@tcp:12/10 lens 440/192 e 0 to 0 dl 1444222100 ref 2 fl Rpc:R/0/0 rc 0/0
            00000100:00100000:0.0:1444222093.352425:0:24751:0:(client.c:2022:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc lhsmtool_posix:706467d4-9a29-db95-1bb0-fb11e4273b13:24751:1514365981493084:10.1.5.12@tcp:60
            00000100:00100000:0.0:1444222093.352437:0:24751:0:(client.c:1627:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc lhsmtool_posix:706467d4-9a29-db95-1bb0-fb11e4273b13:24751:1514365981493088:10.1.5.12@tcp:60
            00000100:00100000:0.0:1444222093.352463:0:24751:0:(client.c:2308:ptlrpc_set_wait()) set ffff88003733d2c0 going to sleep for 6 seconds
            00010000:00080000:0.0:1444222093.353445:0:24751:0:(ldlm_request.c:1307:ldlm_cli_update_pool()) @@@ Zero SLV or Limit found (SLV: 0, Limit: 31283)  req@ffff880053624080 x1514365981493088/t0(0) o60->lustre-MDT0003-mdc-ffff88007ad1d000@10.1.5.12@tcp:12/10 lens 440/192 e 0 to 0 dl 1444222100 ref 2 fl Rpc:R/0/0 rc 0/0
            00000100:00100000:0.0:1444222093.353451:0:24751:0:(client.c:2022:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc lhsmtool_posix:706467d4-9a29-db95-1bb0-fb11e4273b13:24751:1514365981493088:10.1.5.12@tcp:60
            00000080:00200000:0.0:1444222093.353472:0:24751:0:(file.c:332:ll_file_release()) VFS Op:inode=[0x200000002:0x2:0x0](ffff88007c2c25c0)
            00000002:00080000:0.0:1444222093.353504:0:24751:0:(mdc_request.c:776:mdc_close()) @@@ matched open  req@ffff880055a93980 x1514365981492320/t4294975516(4294975516) o101->lustre-MDT0000-mdc-ffff88007ad1d000@10.1.5.16@tcp:12/10 lens 872/544 e 0 to 0 dl 1444222096 ref 1 fl Complete:RP/4/0 rc 0/301
            00000100:00100000:0.0:1444222093.353514:0:24751:0:(client.c:1627:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc lhsmtool_posix:706467d4-9a29-db95-1bb0-fb11e4273b13:24751:1514365981493092:10.1.5.16@tcp:35
            00000100:00100000:0.0:1444222093.353543:0:24751:0:(client.c:2308:ptlrpc_set_wait()) set ffff88003733d2c0 going to sleep for 6 seconds
            00010000:00080000:0.0:1444222093.354529:0:24751:0:(ldlm_request.c:1307:ldlm_cli_update_pool()) @@@ Zero SLV or Limit found (SLV: 3378600000, Limit: 0)  req@ffff880053624080 x1514365981493092/t0(0) o35->lustre-MDT0000-mdc-ffff88007ad1d000@10.1.5.16@tcp:23/10 lens 512/424 e 0 to 0 dl 1444222100 ref 2 fl Rpc:R/0/0 rc 0/0
            00000100:00100000:0.0:1444222093.354536:0:24751:0:(client.c:2022:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc lhsmtool_posix:706467d4-9a29-db95-1bb0-fb11e4273b13:24751:1514365981493092:10.1.5.16@tcp:35
            00000002:00100000:0.0:1444222093.354547:0:24751:0:(mdc_request.c:693:mdc_free_open()) @@@ free open request
              req@ffff880055a93980 x1514365981492320/t4294975516(4294975516) o101->lustre-MDT0000-mdc-ffff88007ad1d000@10.1.5.16@tcp:12/10 lens 872/544 e 0 to 0 dl 1444222096 ref 1 fl Complete:R/4/0 rc 0/301
            00000080:00200000:0.0:1444222093.354560:0:24751:0:(file.c:332:ll_file_release()) VFS Op:inode=[0x200000007:0x1:0x0](ffff88007c2c37c0)
            
            .....................
            
            00000001:02000400:0.0:1444222114.603558:0:25835:0:(debug.c:335:libcfs_debug_mark_buffer()) DEBUG MARKER: pgrep -x lhsmtool_posix
            00000001:02000400:0.0:1444222115.032717:0:25890:0:(debug.c:335:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark  sanity-hsm test_107: @@@@@@ FAIL: Copytool failed to stop in 20s ... 
            00000080:00200000:0.0:1444222115.140512:0:24751:0:(file.c:332:ll_file_release()) VFS Op:inode=[0x200000007:0x1:0x0](ffff88007c2c37c0)
            00000001:02000400:0.0:1444222115.184598:0:25918:0:(debug.c:335:libcfs_debug_mark_buffer()) DEBUG MARKER: sanity-hsm test_107: @@@@@@ FAIL: Copytool failed to stop in 20s ...
            

            with no other activity traced in-between.

            I have just pushed patch #16782 to allow full threads stacks dump on Agent nodes when CT fails to stop.

            bfaccini Bruno Faccini (Inactive) added a comment - The Copytool log shows that both archive actions in tes_107 have successfully completed. The Lustre debug log taken on Agent node seems to indicate that the main copytool thread was still running at the end of the 20s grace period allowed in copytool_cleanup(). More strange is that there are 2 log lines (ll_file_release()) for the single close() of mnt_fd in llapi_hsm_copytool_unregister() : 00000001:02000400:0.0:1444222093.159584:0:25285:0:(debug.c:335:libcfs_debug_mark_buffer()) DEBUG MARKER: pkill -INT -x lhsmtool_posix 00000080:00200000:1.0:1444222093.348446:0:24751:0:(dir.c:1099:ll_dir_ioctl()) VFS Op:inode=[0x200000007:0x1:0x0](ffff88007c2c37c0), cmd=0x401866d5 00000100:00100000:1.0:1444222093.348503:0:24751:0:(client.c:1627:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc lhsmtool_posix:706467d4-9a29-db95-1bb0-fb11e4273b13:24751:1514365981493076:10.1.5.16@tcp:60 00000100:00100000:1.0:1444222093.348551:0:24751:0:(client.c:2308:ptlrpc_set_wait()) set ffff88003733d2c0 going to sleep for 6 seconds 00010000:00080000:1.0:1444222093.349748:0:24751:0:(ldlm_request.c:1307:ldlm_cli_update_pool()) @@@ Zero SLV or Limit found (SLV: 3378600000, Limit: 0) req@ffff880053624080 x1514365981493076/t0(0) o60->lustre-MDT0000-mdc-ffff88007ad1d000@10.1.5.16@tcp:12/10 lens 440/192 e 0 to 0 dl 1444222100 ref 2 fl Rpc:R/0/0 rc 0/0 00000100:00100000:1.0:1444222093.349766:0:24751:0:(client.c:2022:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc lhsmtool_posix:706467d4-9a29-db95-1bb0-fb11e4273b13:24751:1514365981493076:10.1.5.16@tcp:60 00000100:00100000:1.0:1444222093.349815:0:24751:0:(client.c:1627:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc lhsmtool_posix:706467d4-9a29-db95-1bb0-fb11e4273b13:24751:1514365981493080:10.1.5.12@tcp:60 00000100:00100000:1.0:1444222093.349833:0:24751:0:(client.c:2308:ptlrpc_set_wait()) set ffff88003733d2c0 going to sleep for 6 seconds 00010000:00080000:0.0:1444222093.351595:0:24751:0:(ldlm_request.c:1307:ldlm_cli_update_pool()) @@@ Zero SLV or Limit found (SLV: 0, Limit: 31283) req@ffff880053624080 x1514365981493080/t0(0) o60->lustre-MDT0001-mdc-ffff88007ad1d000@10.1.5.12@tcp:12/10 lens 440/192 e 0 to 0 dl 1444222100 ref 2 fl Rpc:R/0/0 rc 0/0 00000100:00100000:0.0:1444222093.351611:0:24751:0:(client.c:2022:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc lhsmtool_posix:706467d4-9a29-db95-1bb0-fb11e4273b13:24751:1514365981493080:10.1.5.12@tcp:60 00000100:00100000:0.0:1444222093.351627:0:24751:0:(client.c:1627:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc lhsmtool_posix:706467d4-9a29-db95-1bb0-fb11e4273b13:24751:1514365981493084:10.1.5.12@tcp:60 00000100:00100000:0.0:1444222093.351641:0:24751:0:(client.c:2308:ptlrpc_set_wait()) set ffff88003733d2c0 going to sleep for 6 seconds 00010000:00080000:0.0:1444222093.352420:0:24751:0:(ldlm_request.c:1307:ldlm_cli_update_pool()) @@@ Zero SLV or Limit found (SLV: 0, Limit: 31283) req@ffff880053624080 x1514365981493084/t0(0) o60->lustre-MDT0002-mdc-ffff88007ad1d000@10.1.5.12@tcp:12/10 lens 440/192 e 0 to 0 dl 1444222100 ref 2 fl Rpc:R/0/0 rc 0/0 00000100:00100000:0.0:1444222093.352425:0:24751:0:(client.c:2022:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc lhsmtool_posix:706467d4-9a29-db95-1bb0-fb11e4273b13:24751:1514365981493084:10.1.5.12@tcp:60 00000100:00100000:0.0:1444222093.352437:0:24751:0:(client.c:1627:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc lhsmtool_posix:706467d4-9a29-db95-1bb0-fb11e4273b13:24751:1514365981493088:10.1.5.12@tcp:60 00000100:00100000:0.0:1444222093.352463:0:24751:0:(client.c:2308:ptlrpc_set_wait()) set ffff88003733d2c0 going to sleep for 6 seconds 00010000:00080000:0.0:1444222093.353445:0:24751:0:(ldlm_request.c:1307:ldlm_cli_update_pool()) @@@ Zero SLV or Limit found (SLV: 0, Limit: 31283) req@ffff880053624080 x1514365981493088/t0(0) o60->lustre-MDT0003-mdc-ffff88007ad1d000@10.1.5.12@tcp:12/10 lens 440/192 e 0 to 0 dl 1444222100 ref 2 fl Rpc:R/0/0 rc 0/0 00000100:00100000:0.0:1444222093.353451:0:24751:0:(client.c:2022:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc lhsmtool_posix:706467d4-9a29-db95-1bb0-fb11e4273b13:24751:1514365981493088:10.1.5.12@tcp:60 00000080:00200000:0.0:1444222093.353472:0:24751:0:(file.c:332:ll_file_release()) VFS Op:inode=[0x200000002:0x2:0x0](ffff88007c2c25c0) 00000002:00080000:0.0:1444222093.353504:0:24751:0:(mdc_request.c:776:mdc_close()) @@@ matched open req@ffff880055a93980 x1514365981492320/t4294975516(4294975516) o101->lustre-MDT0000-mdc-ffff88007ad1d000@10.1.5.16@tcp:12/10 lens 872/544 e 0 to 0 dl 1444222096 ref 1 fl Complete:RP/4/0 rc 0/301 00000100:00100000:0.0:1444222093.353514:0:24751:0:(client.c:1627:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc lhsmtool_posix:706467d4-9a29-db95-1bb0-fb11e4273b13:24751:1514365981493092:10.1.5.16@tcp:35 00000100:00100000:0.0:1444222093.353543:0:24751:0:(client.c:2308:ptlrpc_set_wait()) set ffff88003733d2c0 going to sleep for 6 seconds 00010000:00080000:0.0:1444222093.354529:0:24751:0:(ldlm_request.c:1307:ldlm_cli_update_pool()) @@@ Zero SLV or Limit found (SLV: 3378600000, Limit: 0) req@ffff880053624080 x1514365981493092/t0(0) o35->lustre-MDT0000-mdc-ffff88007ad1d000@10.1.5.16@tcp:23/10 lens 512/424 e 0 to 0 dl 1444222100 ref 2 fl Rpc:R/0/0 rc 0/0 00000100:00100000:0.0:1444222093.354536:0:24751:0:(client.c:2022:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc lhsmtool_posix:706467d4-9a29-db95-1bb0-fb11e4273b13:24751:1514365981493092:10.1.5.16@tcp:35 00000002:00100000:0.0:1444222093.354547:0:24751:0:(mdc_request.c:693:mdc_free_open()) @@@ free open request req@ffff880055a93980 x1514365981492320/t4294975516(4294975516) o101->lustre-MDT0000-mdc-ffff88007ad1d000@10.1.5.16@tcp:12/10 lens 872/544 e 0 to 0 dl 1444222096 ref 1 fl Complete:R/4/0 rc 0/301 00000080:00200000:0.0:1444222093.354560:0:24751:0:(file.c:332:ll_file_release()) VFS Op:inode=[0x200000007:0x1:0x0](ffff88007c2c37c0) ..................... 00000001:02000400:0.0:1444222114.603558:0:25835:0:(debug.c:335:libcfs_debug_mark_buffer()) DEBUG MARKER: pgrep -x lhsmtool_posix 00000001:02000400:0.0:1444222115.032717:0:25890:0:(debug.c:335:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark sanity-hsm test_107: @@@@@@ FAIL: Copytool failed to stop in 20s ... 00000080:00200000:0.0:1444222115.140512:0:24751:0:(file.c:332:ll_file_release()) VFS Op:inode=[0x200000007:0x1:0x0](ffff88007c2c37c0) 00000001:02000400:0.0:1444222115.184598:0:25918:0:(debug.c:335:libcfs_debug_mark_buffer()) DEBUG MARKER: sanity-hsm test_107: @@@@@@ FAIL: Copytool failed to stop in 20s ... with no other activity traced in-between. I have just pushed patch #16782 to allow full threads stacks dump on Agent nodes when CT fails to stop.

            Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/16782
            Subject: LU-7273 tests: dump stacks upon CT stop failure
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4e6a979e6336fd0b4ee6484af7b8cf807b36a944

            gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/16782 Subject: LU-7273 tests: dump stacks upon CT stop failure Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4e6a979e6336fd0b4ee6484af7b8cf807b36a944

            People

              bfaccini Bruno Faccini (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: