Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14208

sanity-hsm test 55 fails with 'request on 0xM:0xN:0x0 is not FAILED on mds1'

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.14.0
    • RHEL8.3 severs/clients
    • 3
    • 9223372036854775807

    Description

      sanity-hsm test_55 fails with 'request on 0x200001b79:0x4:0x0 is not FAILED on mds1' for RHEL 8.3 client/server testing for review-dne-zfs-part-2 and review-dne-part-2 (ldiskfs). Although test 55 has failed many times with this message, the failures for the REHL8.3 testing have a consistent copytool_log where the copy is stuck at 8% (“%8”?):

      lhsmtool_posix: 1607364549.271481 lhsmtool_posix[152920]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre2
      lhsmtool_posix: 1607364549.279162 lhsmtool_posix[152922]: waiting for message from kernel
      lhsmtool_posix: 1607364549.852166 lhsmtool_posix[152922]: copytool fs=lustre archive#=2 item_count=1
      lhsmtool_posix: 1607364549.852331 lhsmtool_posix[152922]: waiting for message from kernel
      lhsmtool_posix: 1607364549.852363 lhsmtool_posix[152923]: '[0x200001b79:0x4:0x0]' action ARCHIVE reclen 72, cookie=0x5fce6f8f
      lhsmtool_posix: 1607364549.853177 lhsmtool_posix[152923]: processing file 'd55.sanity-hsm/f55.sanity-hsm'
      lhsmtool_posix: 1607364549.868566 lhsmtool_posix[152923]: archiving '/mnt/lustre2/.lustre/fid/0x200001b79:0x4:0x0' to '/tmp/arc1/sanity-hsm.test_55//0004/0000/1b79/0000/0002/0000/0x200001b79:0x4:0x0_tmp'
      lhsmtool_posix: 1607364549.868732 lhsmtool_posix[152923]: saving stripe info of '/mnt/lustre2/.lustre/fid/0x200001b79:0x4:0x0' in /tmp/arc1/sanity-hsm.test_55//0004/0000/1b79/0000/0002/0000/0x200001b79:0x4:0x0_tmp.lov
      lhsmtool_posix: 1607364549.870442 lhsmtool_posix[152923]: start copy of 39000000 bytes from '/mnt/lustre2/.lustre/fid/0x200001b79:0x4:0x0' to '/tmp/arc1/sanity-hsm.test_55//0004/0000/1b79/0000/0002/0000/0x200001b79:0x4:0x0_tmp'
      lhsmtool_posix: 1607364579.001120 lhsmtool_posix[152923]: %8 
      lhsmtool_posix: 1607364609.001120 lhsmtool_posix[152923]: %8 
      lhsmtool_posix: 1607364639.000122 lhsmtool_posix[152923]: %8 
      lhsmtool_posix: 1607364669.001121 lhsmtool_posix[152923]: %8 
      lhsmtool_posix: 1607364699.001121 lhsmtool_posix[152923]: %8 
      lhsmtool_posix: 1607364729.001120 lhsmtool_posix[152923]: %8 
      lhsmtool_posix: 1607364759.001121 lhsmtool_posix[152923]: %8 
      exiting: Terminated
      

      The output from the client looks as expected for these failures, from the suite_log, write the file, start the archive and wait for the expected state which, in this case, is “FAILED”

      == sanity-hsm test 55: Truncate during an archive cancels it ========================================= 18:09:06 (1607364546)
      39+0 records in
      39+0 records out
      39000000 bytes (39 MB, 37 MiB) copied, 0.146523 s, 266 MB/s
      CMD: trevis-202vm3 mkdir -p /tmp/arc1/sanity-hsm.test_55/
      Starting copytool agt1 on trevis-202vm3
      CMD: trevis-202vm3 lhsmtool_posix  --daemon --hsm-root "/tmp/arc1/sanity-hsm.test_55/" --bandwidth 1 "/mnt/lustre2" < /dev/null > "/autotest/autotest-1/2020-12-07/lustre-reviews_review-dne-zfs-part-2_78162_1_108_a473c2cb-2a7d-4640-8bee-2a175a1428d2/sanity-hsm.test_55.copytool_log.trevis-202vm3.log" 2>&1
      CMD: trevis-202vm5 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200001b79:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
      CMD: trevis-202vm5 /usr/sbin/lctl set_param mdt.lustre-MDT0000.hsm.policy=+NRA
      mdt.lustre-MDT0000.hsm.policy=+NRA
      CMD: trevis-202vm6 /usr/sbin/lctl set_param mdt.lustre-MDT0001.hsm.policy=+NRA
      mdt.lustre-MDT0001.hsm.policy=+NRA
      CMD: trevis-202vm5 /usr/sbin/lctl set_param mdt.lustre-MDT0002.hsm.policy=+NRA
      mdt.lustre-MDT0002.hsm.policy=+NRA
      CMD: trevis-202vm6 /usr/sbin/lctl set_param mdt.lustre-MDT0003.hsm.policy=+NRA
      mdt.lustre-MDT0003.hsm.policy=+NRA
      CMD: trevis-202vm5 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200001b79:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
      Waiting 200s for 'FAILED'
      CMD: trevis-202vm5 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200001b79:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
      …
      CMD: trevis-202vm5 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200001b79:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
      Update not seen after 200s: want 'FAILED' got 'STARTED'
       sanity-hsm test_55: @@@@@@ FAIL: request on 0x200001b79:0x4:0x0 is not FAILED on mds1 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:6257:error()
        = /usr/lib64/lustre/tests/test-framework.sh:10395:wait_request_state()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:2865:test_55()
      

      Logs for these failures are at
      https://testing.whamcloud.com/test_sets/358b2b64-faf2-4877-b5af-3f8a6cb54155
      https://testing.whamcloud.com/test_sets/1f82471e-e5c3-4057-9064-928b06e616a1

      Attachments

        Issue Links

          Activity

            [LU-14208] sanity-hsm test 55 fails with 'request on 0xM:0xN:0x0 is not FAILED on mds1'
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40966/
            Subject: LU-14208 utils: remove copy_file_range() usage from copytool
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 8294b497f27f66c0f388889ae6d8b29d916a17b4

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40966/ Subject: LU-14208 utils: remove copy_file_range() usage from copytool Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8294b497f27f66c0f388889ae6d8b29d916a17b4

            John L. Hammond (jhammond@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40966
            Subject: LU-14208 utils: remove copy_file_range() usage from copytool
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: b03fe0ddb1f0da1af18b16928074b80e3f2f008b

            gerrit Gerrit Updater added a comment - John L. Hammond (jhammond@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40966 Subject: LU-14208 utils: remove copy_file_range() usage from copytool Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b03fe0ddb1f0da1af18b16928074b80e3f2f008b
            jhammond John Hammond added a comment -

            > I also have to ask is lhsm_posix used in production systems ?

            Yes. It is also intended to be a reference for implementations of other copytools.

            jhammond John Hammond added a comment - > I also have to ask is lhsm_posix used in production systems ? Yes. It is also intended to be a reference for implementations of other copytools.
            simmonsja James A Simmons added a comment - - edited

            So Lustre kernel implement has to be done first then. We have tested internally at ORNL and we saw read + write have the same performance as copy_file_range() using real hardware. Andreas did point out in VMs with fake disk it doesn't perform as well. Mind you that is a not a real world example. I also have to ask is lhsm_posix used in production systems ?

            simmonsja James A Simmons added a comment - - edited So Lustre kernel implement has to be done first then. We have tested internally at ORNL and we saw read + write have the same performance as copy_file_range() using real hardware. Andreas did point out in VMs with fake disk it doesn't perform as well. Mind you that is a not a real world example. I also have to ask is lhsm_posix used in production systems ?
            jhammond John Hammond added a comment -

            This is not an upstream bug, When copy_file_range() is used in the copytool we do not check for EOF (by testing for a return value of 0). That is a defect and is why the test fails. From https://review.whamcloud.com/#/c/38651/5/lustre/utils/lhsmtool_posix.c

            		/* Try the accelerated copy path first. Once LU-10180 lands
            		 * we should deal with holes.
            		 */
            		wsize = copy_file_range(src_fd, NULL, dst_fd, NULL,
            					chunk, 0);
            		if (wsize != -1)
            			goto fini_fastcopy;
            		rc = -errno;
            		/* Before Linux kernel 5.3 copy_file_range only supported
            		 * file copies between filesystems of the same type. In
            		 * that case copy_file_range() will return -EXDEV.
            		 */
            		if (rc != -EXDEV && rc != -ENOSYS) {
            			CT_ERROR(rc, "copy_file_range failed for '%s' to '%s'",
            				 src, dst);
            			break;
            		}
            

            But still, this should be reverted rather than simply fixed in place since copy_file_range() is currently 2x slower than normal read()+write() copy. Once that is fixed then we can integrate copy_file_range() into lhsmtool_posix.

            jhammond John Hammond added a comment - This is not an upstream bug, When copy_file_range() is used in the copytool we do not check for EOF (by testing for a return value of 0). That is a defect and is why the test fails. From https://review.whamcloud.com/#/c/38651/5/lustre/utils/lhsmtool_posix.c /* Try the accelerated copy path first. Once LU-10180 lands * we should deal with holes. */ wsize = copy_file_range(src_fd, NULL, dst_fd, NULL, chunk, 0); if (wsize != -1) goto fini_fastcopy; rc = -errno; /* Before Linux kernel 5.3 copy_file_range only supported * file copies between filesystems of the same type. In * that case copy_file_range() will return -EXDEV. */ if (rc != -EXDEV && rc != -ENOSYS) { CT_ERROR(rc, "copy_file_range failed for '%s' to '%s' " , src, dst); break ; } But still, this should be reverted rather than simply fixed in place since copy_file_range() is currently 2x slower than normal read()+write() copy. Once that is fixed then we can integrate copy_file_range() into lhsmtool_posix .

            Since this is using the default kernel path for copy_file_range() that means most file systems will have broken copy_file_range() implementations. We should report this failure to RedHat since it will impact a much larger audience. The question is does this bug only show up for a specific conditions.

            simmonsja James A Simmons added a comment - Since this is using the default kernel path for copy_file_range() that means most file systems will have broken copy_file_range() implementations. We should report this failure to RedHat since it will impact a much larger audience. The question is does this bug only show up for a specific conditions.
            jhammond John Hammond added a comment -

            > While reverting helps us to pass the test suite this shows that Lustre does not support copy_file_range() in any form on RHEL8.3 This function is starting to be adopted by external applications which is worrying.

            Then please open a new ticket to introduce standalone test code to exercise this functionality. Once it's correctness and the performance benefit is demonstrated I will not object to it being included in any of the utilities.

            jhammond John Hammond added a comment - > While reverting helps us to pass the test suite this shows that Lustre does not support copy_file_range() in any form on RHEL8.3 This function is starting to be adopted by external applications which is worrying. Then please open a new ticket to introduce standalone test code to exercise this functionality. Once it's correctness and the performance benefit is demonstrated I will not object to it being included in any of the utilities.

            While reverting helps us to pass the test suite this shows that Lustre does not support copy_file_range() in any form on RHEL8.3 This function is starting to be adopted by external applications which is worrying.

            simmonsja James A Simmons added a comment - While reverting helps us to pass the test suite this shows that Lustre does not support copy_file_range() in any form on RHEL8.3 This function is starting to be adopted by external applications which is worrying.

            John L. Hammond (jhammond@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40941
            Subject: LU-14208 utils: revert copy_file_range() usage
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: bce74ae6a7254dcffa7fb7d562fd9da08a611a1b

            gerrit Gerrit Updater added a comment - John L. Hammond (jhammond@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40941 Subject: LU-14208 utils: revert copy_file_range() usage Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bce74ae6a7254dcffa7fb7d562fd9da08a611a1b

            People

              jhammond John Hammond
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: