Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.14.0
-
RHEL8.3 severs/clients
-
3
-
9223372036854775807
Description
sanity-hsm test_55 fails with 'request on 0x200001b79:0x4:0x0 is not FAILED on mds1' for RHEL 8.3 client/server testing for review-dne-zfs-part-2 and review-dne-part-2 (ldiskfs). Although test 55 has failed many times with this message, the failures for the REHL8.3 testing have a consistent copytool_log where the copy is stuck at 8% (“%8”?):
lhsmtool_posix: 1607364549.271481 lhsmtool_posix[152920]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre2 lhsmtool_posix: 1607364549.279162 lhsmtool_posix[152922]: waiting for message from kernel lhsmtool_posix: 1607364549.852166 lhsmtool_posix[152922]: copytool fs=lustre archive#=2 item_count=1 lhsmtool_posix: 1607364549.852331 lhsmtool_posix[152922]: waiting for message from kernel lhsmtool_posix: 1607364549.852363 lhsmtool_posix[152923]: '[0x200001b79:0x4:0x0]' action ARCHIVE reclen 72, cookie=0x5fce6f8f lhsmtool_posix: 1607364549.853177 lhsmtool_posix[152923]: processing file 'd55.sanity-hsm/f55.sanity-hsm' lhsmtool_posix: 1607364549.868566 lhsmtool_posix[152923]: archiving '/mnt/lustre2/.lustre/fid/0x200001b79:0x4:0x0' to '/tmp/arc1/sanity-hsm.test_55//0004/0000/1b79/0000/0002/0000/0x200001b79:0x4:0x0_tmp' lhsmtool_posix: 1607364549.868732 lhsmtool_posix[152923]: saving stripe info of '/mnt/lustre2/.lustre/fid/0x200001b79:0x4:0x0' in /tmp/arc1/sanity-hsm.test_55//0004/0000/1b79/0000/0002/0000/0x200001b79:0x4:0x0_tmp.lov lhsmtool_posix: 1607364549.870442 lhsmtool_posix[152923]: start copy of 39000000 bytes from '/mnt/lustre2/.lustre/fid/0x200001b79:0x4:0x0' to '/tmp/arc1/sanity-hsm.test_55//0004/0000/1b79/0000/0002/0000/0x200001b79:0x4:0x0_tmp' lhsmtool_posix: 1607364579.001120 lhsmtool_posix[152923]: %8 lhsmtool_posix: 1607364609.001120 lhsmtool_posix[152923]: %8 lhsmtool_posix: 1607364639.000122 lhsmtool_posix[152923]: %8 lhsmtool_posix: 1607364669.001121 lhsmtool_posix[152923]: %8 lhsmtool_posix: 1607364699.001121 lhsmtool_posix[152923]: %8 lhsmtool_posix: 1607364729.001120 lhsmtool_posix[152923]: %8 lhsmtool_posix: 1607364759.001121 lhsmtool_posix[152923]: %8 exiting: Terminated
The output from the client looks as expected for these failures, from the suite_log, write the file, start the archive and wait for the expected state which, in this case, is “FAILED”
== sanity-hsm test 55: Truncate during an archive cancels it ========================================= 18:09:06 (1607364546) 39+0 records in 39+0 records out 39000000 bytes (39 MB, 37 MiB) copied, 0.146523 s, 266 MB/s CMD: trevis-202vm3 mkdir -p /tmp/arc1/sanity-hsm.test_55/ Starting copytool agt1 on trevis-202vm3 CMD: trevis-202vm3 lhsmtool_posix --daemon --hsm-root "/tmp/arc1/sanity-hsm.test_55/" --bandwidth 1 "/mnt/lustre2" < /dev/null > "/autotest/autotest-1/2020-12-07/lustre-reviews_review-dne-zfs-part-2_78162_1_108_a473c2cb-2a7d-4640-8bee-2a175a1428d2/sanity-hsm.test_55.copytool_log.trevis-202vm3.log" 2>&1 CMD: trevis-202vm5 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200001b79:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: trevis-202vm5 /usr/sbin/lctl set_param mdt.lustre-MDT0000.hsm.policy=+NRA mdt.lustre-MDT0000.hsm.policy=+NRA CMD: trevis-202vm6 /usr/sbin/lctl set_param mdt.lustre-MDT0001.hsm.policy=+NRA mdt.lustre-MDT0001.hsm.policy=+NRA CMD: trevis-202vm5 /usr/sbin/lctl set_param mdt.lustre-MDT0002.hsm.policy=+NRA mdt.lustre-MDT0002.hsm.policy=+NRA CMD: trevis-202vm6 /usr/sbin/lctl set_param mdt.lustre-MDT0003.hsm.policy=+NRA mdt.lustre-MDT0003.hsm.policy=+NRA CMD: trevis-202vm5 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200001b79:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= Waiting 200s for 'FAILED' CMD: trevis-202vm5 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200001b79:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= … CMD: trevis-202vm5 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200001b79:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= Update not seen after 200s: want 'FAILED' got 'STARTED' sanity-hsm test_55: @@@@@@ FAIL: request on 0x200001b79:0x4:0x0 is not FAILED on mds1 Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:6257:error() = /usr/lib64/lustre/tests/test-framework.sh:10395:wait_request_state() = /usr/lib64/lustre/tests/sanity-hsm.sh:2865:test_55()
Logs for these failures are at
https://testing.whamcloud.com/test_sets/358b2b64-faf2-4877-b5af-3f8a6cb54155
https://testing.whamcloud.com/test_sets/1f82471e-e5c3-4057-9064-928b06e616a1
Attachments
Issue Links
- is related to
-
LU-11621 Add copy_file_range() API and use it for lfs migrate and mirror resync
- Open