[LU-14208] sanity-hsm test 55 fails with 'request on 0xM:0xN:0x0 is not FAILED on mds1' Created: 10/Dec/20 Updated: 17/Dec/20 Resolved: 17/Dec/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | John Hammond |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | rhel8.3 | ||
| Environment: |
RHEL8.3 severs/clients |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
sanity-hsm test_55 fails with 'request on 0x200001b79:0x4:0x0 is not FAILED on mds1' for RHEL 8.3 client/server testing for review-dne-zfs-part-2 and review-dne-part-2 (ldiskfs). Although test 55 has failed many times with this message, the failures for the REHL8.3 testing have a consistent copytool_log where the copy is stuck at 8% (“%8”?): lhsmtool_posix: 1607364549.271481 lhsmtool_posix[152920]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre2 lhsmtool_posix: 1607364549.279162 lhsmtool_posix[152922]: waiting for message from kernel lhsmtool_posix: 1607364549.852166 lhsmtool_posix[152922]: copytool fs=lustre archive#=2 item_count=1 lhsmtool_posix: 1607364549.852331 lhsmtool_posix[152922]: waiting for message from kernel lhsmtool_posix: 1607364549.852363 lhsmtool_posix[152923]: '[0x200001b79:0x4:0x0]' action ARCHIVE reclen 72, cookie=0x5fce6f8f lhsmtool_posix: 1607364549.853177 lhsmtool_posix[152923]: processing file 'd55.sanity-hsm/f55.sanity-hsm' lhsmtool_posix: 1607364549.868566 lhsmtool_posix[152923]: archiving '/mnt/lustre2/.lustre/fid/0x200001b79:0x4:0x0' to '/tmp/arc1/sanity-hsm.test_55//0004/0000/1b79/0000/0002/0000/0x200001b79:0x4:0x0_tmp' lhsmtool_posix: 1607364549.868732 lhsmtool_posix[152923]: saving stripe info of '/mnt/lustre2/.lustre/fid/0x200001b79:0x4:0x0' in /tmp/arc1/sanity-hsm.test_55//0004/0000/1b79/0000/0002/0000/0x200001b79:0x4:0x0_tmp.lov lhsmtool_posix: 1607364549.870442 lhsmtool_posix[152923]: start copy of 39000000 bytes from '/mnt/lustre2/.lustre/fid/0x200001b79:0x4:0x0' to '/tmp/arc1/sanity-hsm.test_55//0004/0000/1b79/0000/0002/0000/0x200001b79:0x4:0x0_tmp' lhsmtool_posix: 1607364579.001120 lhsmtool_posix[152923]: %8 lhsmtool_posix: 1607364609.001120 lhsmtool_posix[152923]: %8 lhsmtool_posix: 1607364639.000122 lhsmtool_posix[152923]: %8 lhsmtool_posix: 1607364669.001121 lhsmtool_posix[152923]: %8 lhsmtool_posix: 1607364699.001121 lhsmtool_posix[152923]: %8 lhsmtool_posix: 1607364729.001120 lhsmtool_posix[152923]: %8 lhsmtool_posix: 1607364759.001121 lhsmtool_posix[152923]: %8 exiting: Terminated The output from the client looks as expected for these failures, from the suite_log, write the file, start the archive and wait for the expected state which, in this case, is “FAILED” == sanity-hsm test 55: Truncate during an archive cancels it ========================================= 18:09:06 (1607364546)
39+0 records in
39+0 records out
39000000 bytes (39 MB, 37 MiB) copied, 0.146523 s, 266 MB/s
CMD: trevis-202vm3 mkdir -p /tmp/arc1/sanity-hsm.test_55/
Starting copytool agt1 on trevis-202vm3
CMD: trevis-202vm3 lhsmtool_posix --daemon --hsm-root "/tmp/arc1/sanity-hsm.test_55/" --bandwidth 1 "/mnt/lustre2" < /dev/null > "/autotest/autotest-1/2020-12-07/lustre-reviews_review-dne-zfs-part-2_78162_1_108_a473c2cb-2a7d-4640-8bee-2a175a1428d2/sanity-hsm.test_55.copytool_log.trevis-202vm3.log" 2>&1
CMD: trevis-202vm5 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200001b79:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: trevis-202vm5 /usr/sbin/lctl set_param mdt.lustre-MDT0000.hsm.policy=+NRA
mdt.lustre-MDT0000.hsm.policy=+NRA
CMD: trevis-202vm6 /usr/sbin/lctl set_param mdt.lustre-MDT0001.hsm.policy=+NRA
mdt.lustre-MDT0001.hsm.policy=+NRA
CMD: trevis-202vm5 /usr/sbin/lctl set_param mdt.lustre-MDT0002.hsm.policy=+NRA
mdt.lustre-MDT0002.hsm.policy=+NRA
CMD: trevis-202vm6 /usr/sbin/lctl set_param mdt.lustre-MDT0003.hsm.policy=+NRA
mdt.lustre-MDT0003.hsm.policy=+NRA
CMD: trevis-202vm5 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200001b79:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Waiting 200s for 'FAILED'
CMD: trevis-202vm5 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200001b79:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
…
CMD: trevis-202vm5 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200001b79:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Update not seen after 200s: want 'FAILED' got 'STARTED'
sanity-hsm test_55: @@@@@@ FAIL: request on 0x200001b79:0x4:0x0 is not FAILED on mds1
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:6257:error()
= /usr/lib64/lustre/tests/test-framework.sh:10395:wait_request_state()
= /usr/lib64/lustre/tests/sanity-hsm.sh:2865:test_55()
Logs for these failures are at |
| Comments |
| Comment by Gerrit Updater [ 11/Dec/20 ] |
|
John L. Hammond (jhammond@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40941 |
| Comment by James A Simmons [ 11/Dec/20 ] |
|
While reverting helps us to pass the test suite this shows that Lustre does not support copy_file_range() in any form on RHEL8.3 This function is starting to be adopted by external applications which is worrying. |
| Comment by John Hammond [ 11/Dec/20 ] |
|
> While reverting helps us to pass the test suite this shows that Lustre does not support copy_file_range() in any form on RHEL8.3 This function is starting to be adopted by external applications which is worrying. Then please open a new ticket to introduce standalone test code to exercise this functionality. Once it's correctness and the performance benefit is demonstrated I will not object to it being included in any of the utilities. |
| Comment by James A Simmons [ 11/Dec/20 ] |
|
Since this is using the default kernel path for copy_file_range() that means most file systems will have broken copy_file_range() implementations. We should report this failure to RedHat since it will impact a much larger audience. The question is does this bug only show up for a specific conditions. |
| Comment by John Hammond [ 11/Dec/20 ] |
|
This is not an upstream bug, When copy_file_range() is used in the copytool we do not check for EOF (by testing for a return value of 0). That is a defect and is why the test fails. From https://review.whamcloud.com/#/c/38651/5/lustre/utils/lhsmtool_posix.c /* Try the accelerated copy path first. Once LU-10180 lands * we should deal with holes. */ wsize = copy_file_range(src_fd, NULL, dst_fd, NULL, chunk, 0); if (wsize != -1) goto fini_fastcopy; rc = -errno; /* Before Linux kernel 5.3 copy_file_range only supported * file copies between filesystems of the same type. In * that case copy_file_range() will return -EXDEV. */ if (rc != -EXDEV && rc != -ENOSYS) { CT_ERROR(rc, "copy_file_range failed for '%s' to '%s'", src, dst); break; } But still, this should be reverted rather than simply fixed in place since copy_file_range() is currently 2x slower than normal read()+write() copy. Once that is fixed then we can integrate copy_file_range() into lhsmtool_posix. |
| Comment by James A Simmons [ 11/Dec/20 ] |
|
So Lustre kernel implement has to be done first then. We have tested internally at ORNL and we saw read + write have the same performance as copy_file_range() using real hardware. Andreas did point out in VMs with fake disk it doesn't perform as well. Mind you that is a not a real world example. I also have to ask is lhsm_posix used in production systems ? |
| Comment by John Hammond [ 14/Dec/20 ] |
|
> I also have to ask is lhsm_posix used in production systems ? Yes. It is also intended to be a reference for implementations of other copytools. |
| Comment by Gerrit Updater [ 14/Dec/20 ] |
|
John L. Hammond (jhammond@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40966 |
| Comment by Gerrit Updater [ 17/Dec/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40966/ |
| Comment by Peter Jones [ 17/Dec/20 ] |
|
Landed for 2.14 |