Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9991

sanity-hsm test 31c fails with 'copytools failed to stop'

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      sanity-hsm test_31c fails to archive the newly created file:

      Update not seen after 200s: wanted 'SUCCEED' got 'STARTED'
       sanity-hsm test_31c: @@@@@@ FAIL: request on 0x200000402:0x38:0x0 is not SUCCEED on mds1 
      

      and then we can shutdown the copytool:

      copytools still running on trevis-51vm6
      CMD: trevis-51vm6 pgrep -x lhsmtool_posix
      trevis-51vm6: 6036
      copytools still running on trevis-51vm6
      CMD: trevis-51vm6 echo 1 >/proc/sys/kernel/sysrq ;  echo t >/proc/sysrq-trigger
      copytools failed to stop in 200s
       sanity-hsm test_31c: @@@@@@ FAIL: copytools failed to stop 
      

      From the copytool logs, the archive was progressing and it doesn’t look like it exceeded 200 seconds unless the copytool was hung:

      1505294662.538461 lhsmtool_posix[6038]: '[0x200000402:0x38:0x0]' action ARCHIVE reclen 72, cookie=0x59b8f942
      1505294662.539849 lhsmtool_posix[6038]: processing file 'd31c.sanity-hsm/f31c.sanity-hsm'
      1505294662.555950 lhsmtool_posix[6038]: archiving '/mnt/lustre2/.lustre/fid/0x200000402:0x38:0x0' to '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp'
      1505294662.556901 lhsmtool_posix[6038]: saving stripe info of '/mnt/lustre2/.lustre/fid/0x200000402:0x38:0x0' in /tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp.lov
      1505294662.558302 lhsmtool_posix[6038]: start copy of 34603008 bytes from '/mnt/lustre2/.lustre/fid/0x200000402:0x38:0x0' to '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp'
      1505294692.836074 lhsmtool_posix[6038]: %90 
      1505294692.838435 lhsmtool_posix[6038]: bandwith control: 1048576B/s excess=1048576 sleep for 1.000000000s
      1505294695.842294 lhsmtool_posix[6038]: copied 34603008 bytes in 33.285317 seconds
      1505294695.896192 lhsmtool_posix[6038]: data archiving for '/mnt/lustre2/.lustre/fid/0x200000402:0x38:0x0' to '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp' done
      1505294695.896379 lhsmtool_posix[6038]: attr file for '/mnt/lustre2/.lustre/fid/0x200000402:0x38:0x0' saved to archive '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp'
      1505294695.897327 lhsmtool_posix[6038]: fsetxattr of 'trusted.hsm' on '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp' rc=0 (Success)
      1505294695.897360 lhsmtool_posix[6038]: fsetxattr of 'trusted.version' on '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp' rc=0 (Success)
      1505294695.897402 lhsmtool_posix[6038]: fsetxattr of 'trusted.link' on '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp' rc=0 (Success)
      1505294695.897432 lhsmtool_posix[6038]: fsetxattr of 'trusted.lov' on '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp' rc=0 (Success)
      1505294695.897476 lhsmtool_posix[6038]: fsetxattr of 'trusted.lma' on '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp' rc=0 (Success)
      1505294695.897515 lhsmtool_posix[6038]: fsetxattr of 'lustre.lov' on '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp' rc=-1 (Operation not supported)
      1505294695.897531 lhsmtool_posix[6038]: xattr file for '/mnt/lustre2/.lustre/fid/0x200000402:0x38:0x0' saved to archive '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp'
      1505294695.898946 lhsmtool_posix[6038]: symlink '/tmp/arc1/shsm/shadow/d31c.sanity-hsm/f31c.sanity-hsm' to '../../0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0' done
      exiting: Terminated
      

      So far, we’ve only see this ‘copytool failed to stop’ error for one patch test session. Logs for this failure are at
      https://testing.hpdd.intel.com/test_sets/3e2f0620-9872-11e7-b775-5254006e85c2

      We’ve seen the fail to archive a file several times for this test and the failure has at least once been attributed to LU-7988.

      Attachments

        Issue Links

          Activity

            [LU-9991] sanity-hsm test 31c fails with 'copytools failed to stop'
            jhammond John Hammond added a comment -

            Closing as a duplicate of LU-9845.

            jhammond John Hammond added a comment - Closing as a duplicate of LU-9845 .

            James,
            In the auto-test failure logs you have reported, we can see that the copy tool's threads are stuck trying to close the Lustre file :

            09:31:59:[12891.943987] lhsmtool_posix  S 000000000000f908     0  6036      1 0x00000080
            09:31:59:[12891.945640]  ffff88006c5b3b30 0000000000000086 ffff88005e4d3f40 ffff88006c5b3fd8
            09:31:59:[12891.947482]  ffff88006c5b3fd8 ffff88006c5b3fd8 ffff88005e4d3f40 ffffffff81e87180
            09:31:59:[12891.949277]  ffff88006c5b3b60 0000000100c07565 ffffffff81e87180 000000000000f908
            09:31:59:[12891.950973] Call Trace:
            09:31:59:[12891.952275]  [<ffffffff816a94c9>] schedule+0x29/0x70
            09:31:59:[12891.953752]  [<ffffffff816a6f14>] schedule_timeout+0x174/0x2c0
            09:31:59:[12891.955315]  [<ffffffff81098b20>] ? internal_add_timer+0x70/0x70
            09:31:59:[12891.956858]  [<ffffffffc09adec0>] ptlrpc_set_wait+0x4c0/0x910 [ptlrpc]
            09:31:59:[12891.958477]  [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
            09:31:59:[12891.959988]  [<ffffffffc09ae38d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
            09:31:59:[12891.961644]  [<ffffffffc0ac75cc>] mdc_close+0x1bc/0x8a0 [mdc]
            09:31:59:[12891.963168]  [<ffffffffc0b6f426>] lmv_close+0x216/0x540 [lmv]
            09:31:59:[12891.964760]  [<ffffffffc0bb4b48>] ll_close_inode_openhandle+0x2f8/0xe20 [lustre]
            09:31:59:[12891.966402]  [<ffffffffc0bb8270>] ll_md_real_close+0xf0/0x1e0 [lustre]
            09:31:59:[12891.968043]  [<ffffffffc0bb899d>] ll_file_release+0x63d/0xa70 [lustre]
            09:31:59:[12891.969592]  [<ffffffffc0ba424f>] ll_dir_release+0x2f/0xd0 [lustre]
            09:31:59:[12891.971179]  [<ffffffff81202fb9>] __fput+0xe9/0x260
            09:31:59:[12891.972581]  [<ffffffff8120321e>] ____fput+0xe/0x10
            09:31:59:[12891.974049]  [<ffffffff810ad247>] task_work_run+0xa7/0xf0
            09:31:59:[12891.975494]  [<ffffffff8102ab62>] do_notify_resume+0x92/0xb0
            09:31:59:[12891.977059]  [<ffffffff816b527d>] int_signal+0x12/0x17
            09:31:59:[12891.978503] lhsmtool_posix  S 000000000000f908     0  6038      1 0x00000080
            09:31:59:[12891.980173]  ffff88005d3b3b50 0000000000000086 ffff880063d1bf40 ffff88005d3b3fd8
            09:31:59:[12891.981897]  ffff88005d3b3fd8 ffff88005d3b3fd8 ffff880063d1bf40 ffff88007c100000
            09:31:59:[12891.983599]  ffff88005d3b3b80 0000000100c04b03 ffff88007c100000 000000000000f908
            09:31:59:[12891.985316] Call Trace:
            09:31:59:[12891.986518]  [<ffffffff816a94c9>] schedule+0x29/0x70
            09:31:59:[12891.987982]  [<ffffffff816a6f14>] schedule_timeout+0x174/0x2c0
            09:31:59:[12891.989450]  [<ffffffff81098b20>] ? internal_add_timer+0x70/0x70
            09:31:59:[12891.991030]  [<ffffffffc09adec0>] ptlrpc_set_wait+0x4c0/0x910 [ptlrpc]
            09:31:59:[12891.992610]  [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
            09:31:59:[12891.994084]  [<ffffffffc09ae38d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
            09:31:59:[12891.995683]  [<ffffffffc0ac75cc>] mdc_close+0x1bc/0x8a0 [mdc]
            09:31:59:[12891.997162]  [<ffffffffc0b6f426>] lmv_close+0x216/0x540 [lmv]
            09:31:59:[12891.998691]  [<ffffffffc0bb4b48>] ll_close_inode_openhandle+0x2f8/0xe20 [lustre]
            09:31:59:[12892.000295]  [<ffffffffc0bb8270>] ll_md_real_close+0xf0/0x1e0 [lustre]
            09:31:59:[12892.001882]  [<ffffffffc0bb899d>] ll_file_release+0x63d/0xa70 [lustre]
            09:31:59:[12892.003395]  [<ffffffff81202fb9>] __fput+0xe9/0x260
            09:31:59:[12892.004846]  [<ffffffff8120321e>] ____fput+0xe/0x10
            09:31:59:[12892.006206]  [<ffffffff810ad247>] task_work_run+0xa7/0xf0
            09:31:59:[12892.007663]  [<ffffffff8102ab62>] do_notify_resume+0x92/0xb0
            09:31:59:[12892.009093]  [<ffffffff816b527d>] int_signal+0x12/0x17
            

            this explains why the Archive operation is still in Started state and not Succeed due to end of archiving could not be sent by copytool and then why they can't be killed/stopped during sub-test exit.

            And on the MDS/MDT side we can find the following error message after the archive operation has been started :

            09:25:01:[12442.007026] WARNING: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and has been suspended.
            

            so my guess is that the problem is on the ZFS/media side.

            bfaccini Bruno Faccini (Inactive) added a comment - James, In the auto-test failure logs you have reported, we can see that the copy tool's threads are stuck trying to close the Lustre file : 09:31:59:[12891.943987] lhsmtool_posix S 000000000000f908 0 6036 1 0x00000080 09:31:59:[12891.945640] ffff88006c5b3b30 0000000000000086 ffff88005e4d3f40 ffff88006c5b3fd8 09:31:59:[12891.947482] ffff88006c5b3fd8 ffff88006c5b3fd8 ffff88005e4d3f40 ffffffff81e87180 09:31:59:[12891.949277] ffff88006c5b3b60 0000000100c07565 ffffffff81e87180 000000000000f908 09:31:59:[12891.950973] Call Trace: 09:31:59:[12891.952275] [<ffffffff816a94c9>] schedule+0x29/0x70 09:31:59:[12891.953752] [<ffffffff816a6f14>] schedule_timeout+0x174/0x2c0 09:31:59:[12891.955315] [<ffffffff81098b20>] ? internal_add_timer+0x70/0x70 09:31:59:[12891.956858] [<ffffffffc09adec0>] ptlrpc_set_wait+0x4c0/0x910 [ptlrpc] 09:31:59:[12891.958477] [<ffffffff810c4810>] ? wake_up_state+0x20/0x20 09:31:59:[12891.959988] [<ffffffffc09ae38d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc] 09:31:59:[12891.961644] [<ffffffffc0ac75cc>] mdc_close+0x1bc/0x8a0 [mdc] 09:31:59:[12891.963168] [<ffffffffc0b6f426>] lmv_close+0x216/0x540 [lmv] 09:31:59:[12891.964760] [<ffffffffc0bb4b48>] ll_close_inode_openhandle+0x2f8/0xe20 [lustre] 09:31:59:[12891.966402] [<ffffffffc0bb8270>] ll_md_real_close+0xf0/0x1e0 [lustre] 09:31:59:[12891.968043] [<ffffffffc0bb899d>] ll_file_release+0x63d/0xa70 [lustre] 09:31:59:[12891.969592] [<ffffffffc0ba424f>] ll_dir_release+0x2f/0xd0 [lustre] 09:31:59:[12891.971179] [<ffffffff81202fb9>] __fput+0xe9/0x260 09:31:59:[12891.972581] [<ffffffff8120321e>] ____fput+0xe/0x10 09:31:59:[12891.974049] [<ffffffff810ad247>] task_work_run+0xa7/0xf0 09:31:59:[12891.975494] [<ffffffff8102ab62>] do_notify_resume+0x92/0xb0 09:31:59:[12891.977059] [<ffffffff816b527d>] int_signal+0x12/0x17 09:31:59:[12891.978503] lhsmtool_posix S 000000000000f908 0 6038 1 0x00000080 09:31:59:[12891.980173] ffff88005d3b3b50 0000000000000086 ffff880063d1bf40 ffff88005d3b3fd8 09:31:59:[12891.981897] ffff88005d3b3fd8 ffff88005d3b3fd8 ffff880063d1bf40 ffff88007c100000 09:31:59:[12891.983599] ffff88005d3b3b80 0000000100c04b03 ffff88007c100000 000000000000f908 09:31:59:[12891.985316] Call Trace: 09:31:59:[12891.986518] [<ffffffff816a94c9>] schedule+0x29/0x70 09:31:59:[12891.987982] [<ffffffff816a6f14>] schedule_timeout+0x174/0x2c0 09:31:59:[12891.989450] [<ffffffff81098b20>] ? internal_add_timer+0x70/0x70 09:31:59:[12891.991030] [<ffffffffc09adec0>] ptlrpc_set_wait+0x4c0/0x910 [ptlrpc] 09:31:59:[12891.992610] [<ffffffff810c4810>] ? wake_up_state+0x20/0x20 09:31:59:[12891.994084] [<ffffffffc09ae38d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc] 09:31:59:[12891.995683] [<ffffffffc0ac75cc>] mdc_close+0x1bc/0x8a0 [mdc] 09:31:59:[12891.997162] [<ffffffffc0b6f426>] lmv_close+0x216/0x540 [lmv] 09:31:59:[12891.998691] [<ffffffffc0bb4b48>] ll_close_inode_openhandle+0x2f8/0xe20 [lustre] 09:31:59:[12892.000295] [<ffffffffc0bb8270>] ll_md_real_close+0xf0/0x1e0 [lustre] 09:31:59:[12892.001882] [<ffffffffc0bb899d>] ll_file_release+0x63d/0xa70 [lustre] 09:31:59:[12892.003395] [<ffffffff81202fb9>] __fput+0xe9/0x260 09:31:59:[12892.004846] [<ffffffff8120321e>] ____fput+0xe/0x10 09:31:59:[12892.006206] [<ffffffff810ad247>] task_work_run+0xa7/0xf0 09:31:59:[12892.007663] [<ffffffff8102ab62>] do_notify_resume+0x92/0xb0 09:31:59:[12892.009093] [<ffffffff816b527d>] int_signal+0x12/0x17 this explains why the Archive operation is still in Started state and not Succeed due to end of archiving could not be sent by copytool and then why they can't be killed/stopped during sub-test exit. And on the MDS/MDT side we can find the following error message after the archive operation has been started : 09:25:01:[12442.007026] WARNING: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and has been suspended. so my guess is that the problem is on the ZFS/media side.

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: