Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3761

sanity-hsm 55: request on 0x2000013a3:0x2:0x0 is not FAILED

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.5.0
    • 3
    • 9693

    Description

      This issue was created by maloo for Li Wei <liwei@whamcloud.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/b716e4c2-0523-11e3-925a-52540035b04c.

      The sub-test test_55 failed with the following error:

      request on 0x2000013a3:0x2:0x0 is not FAILED

      Info required for matching: sanity-hsm 55

      == sanity-hsm test 55: Truncate during an archive cancels it == 08:22:27 (1376493747)
      Purging archive
      Starting copytool
      lhsmtool_posix --hsm-root /tmp/arc --daemon --bandwidth 1 /mnt/lustre
      2+0 records in
      2+0 records out
      2097152 bytes (2.1 MB) copied, 0.464772 s, 4.5 MB/s
      CMD: wtm-10vm7 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | grep 0x2000013a3:0x2:0x0 | grep action=ARCHIVE | cut -f 13 -d ' ' | cut -f 2 -d =
      [...]
      CMD: wtm-10vm7 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | grep 0x2000013a3:0x2:0x0 | grep action=ARCHIVE | cut -f 13 -d ' ' | cut -f 2 -d =
      Update not seen after 100s: wanted 'FAILED' got ''
      sanity-hsm test_55: @@@@@@ FAIL: request on 0x2000013a3:0x2:0x0 is not FAILED

      Attachments

        Issue Links

          Activity

            [LU-3761] sanity-hsm 55: request on 0x2000013a3:0x2:0x0 is not FAILED

            This issue should have been fixed at LU-3815 where it addressed a similar issue. Let's close it for now.

            jay Jinshan Xiong (Inactive) added a comment - This issue should have been fixed at LU-3815 where it addressed a similar issue. Let's close it for now.

            The posix CT has a parameter --bandwith which limit the visible BW on read an writes, so given a file size you have an idea of the archive/restore time. If the file is large enough the race will disappear

            jcl jacques-charles lafoucriere added a comment - The posix CT has a parameter --bandwith which limit the visible BW on read an writes, so given a file size you have an idea of the archive/restore time. If the file is large enough the race will disappear

            I tend to think this is a timing issue where we expect that archive would fail when a truncate is on going, however, the archive operation may have already finished before truncate is started. In that case, this test case will fail.

            I will work out a patch.

            jay Jinshan Xiong (Inactive) added a comment - I tend to think this is a timing issue where we expect that archive would fail when a truncate is on going, however, the archive operation may have already finished before truncate is started. In that case, this test case will fail. I will work out a patch.
            green Oleg Drokin added a comment -

            So I wonder why is this a blocker, can somebody explain the nature of the problem in some detail please?

            green Oleg Drokin added a comment - So I wonder why is this a blocker, can somebody explain the nature of the problem in some detail please?

            Is this parameter for testing purpose only? Because I can't think of a reason why coordinator needs to delay deleting an entry.

            jay Jinshan Xiong (Inactive) added a comment - Is this parameter for testing purpose only? Because I can't think of a reason why coordinator needs to delay deleting an entry.

            There is already a tunnable cdt_delay which set the number of seconds before a llog entry in a finished state (SUCCESS ou FAILURE) is removed. The default is 60s, sanity-hsm change it to 10s, so this value is too small.

            jcl jacques-charles lafoucriere added a comment - There is already a tunnable cdt_delay which set the number of seconds before a llog entry in a finished state (SUCCESS ou FAILURE) is removed. The default is 60s, sanity-hsm change it to 10s, so this value is too small.

            It turns out the way wait_request_state() implemented is unreliable. It depends on the HSM llog state for the progress but if the operation already finished when the client is acquiring the state, the check will fail because there is nothing in agent_actions.

            Probably we can change HSM llog a little bit to delay deleting items. It should be good for testing purpose.

            jay Jinshan Xiong (Inactive) added a comment - It turns out the way wait_request_state() implemented is unreliable. It depends on the HSM llog state for the progress but if the operation already finished when the client is acquiring the state, the check will fail because there is nothing in agent_actions. Probably we can change HSM llog a little bit to delay deleting items. It should be good for testing purpose.

            Jinshan,
            Can you have a look at this one and upgrade to blocker if needed?
            Thank you!

            jlevi Jodi Levi (Inactive) added a comment - Jinshan, Can you have a look at this one and upgrade to blocker if needed? Thank you!

            People

              jay Jinshan Xiong (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: