Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.13.0, Lustre 2.12.1, Lustre 2.14.0, Lustre 2.15.0
-
None
-
3
-
9223372036854775807
Description
This issue was created by maloo for paf <pfarrell@whamcloud.com>
This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/39847bcc-3985-11e9-8f69-52540065bddc
Error given is checksum error, but mirror resync just failed entirely. Test should probably be updated to catch the failure there rather than report a checksum error later:
lock to resync file /mnt/lustre3/f200.sanity-flr with 'mirror_io resync -e resync_start' ..failed
resync file /mnt/lustre3/f200.sanity-flr with '/usr/bin/lfs mirror resync' ..failed
resync file /mnt/lustre3/f200.sanity-flr with '/usr/bin/lfs mirror resync' ..failed
resync file /mnt/lustre3/f200.sanity-flr with '/usr/bin/lfs mirror resync' ..failed
lock to resync file /mnt/lustre3/f200.sanity-flr with '/usr/bin/lfs mirror resync' ..failed
lock to resync file /mnt/lustre3/f200.sanity-flr with '/usr/bin/lfs mirror resync' ..failed
lock to resync file /mnt/lustre3/f200.sanity-flr with '/usr/bin/lfs mirror resync' ..failed
lock to resync file /mnt/lustre3/f200.sanity-flr with 'mirror_io resync -e resync_start' ..failed
resync file /mnt/lustre3/f200.sanity-flr with '/usr/bin/lfs mirror resync' ..failed
resync file /mnt/lustre3/f200.sanity-flr with 'mirror_io resync -e delay_before_copy -d 1' ..failed
resync file /mnt/lustre3/f200.sanity-flr with '/usr/bin/lfs mirror resync' ..failed
resync file /mnt/lustre3/f200.sanity-flr with 'mirror_io resync -e delay_before_copy -d 1' ..failed
lock to resync file /mnt/lustre3/f200.sanity-flr with 'mirror_io resync -e resync_start' ..failed
resync file /mnt/lustre3/f200.sanity-flr with '/usr/bin/lfs mirror resync' ..failed
resync file /mnt/lustre3/f200.sanity-flr with '/usr/bin/lfs mirror resync' ..failed
resync file /mnt/lustre3/f200.sanity-flr with '/usr/bin/lfs mirror resync' ..failed
resync file /mnt/lustre3/f200.sanity-flr with 'mirror_io resync -e resync_start' ..failed
resync file /mnt/lustre3/f200.sanity-flr with '/usr/bin/lfs mirror resync' ..failed
resync file /mnt/lustre3/f200.sanity-flr with 'mirror_io resync -e resync_start' ..failed
resync file /mnt/lustre3/f200.sanity-flr with 'mirror_io resync -e delay_before_copy -d 1' ..failed
resync file /mnt/lustre3/f200.sanity-flr with 'mirror_io resync -e resync_start' ..failed
resync file /mnt/lustre3/f200.sanity-flr with 'mirror_io resync -e delay_before_copy -d 1' ..Waiting 7585 7586 7587 7589 7590
failed
10.9.4.240@tcp:/lustre /mnt/lustre2 lustre rw,flock,user_xattr,lazystatfs 0 0
CMD: trevis-20vm1.trevis.whamcloud.com grep -c /mnt/lustre2' ' /proc/mounts
Stopping client trevis-20vm1.trevis.whamcloud.com /mnt/lustre2 (opts
CMD: trevis-20vm1.trevis.whamcloud.com lsof -t /mnt/lustre2
CMD: trevis-20vm1.trevis.whamcloud.com umount /mnt/lustre2 2>&1
10.9.4.240@tcp:/lustre /mnt/lustre3 lustre rw,flock,user_xattr,lazystatfs 0 0
CMD: trevis-20vm1.trevis.whamcloud.com grep -c /mnt/lustre3' ' /proc/mounts
Stopping client trevis-20vm1.trevis.whamcloud.com /mnt/lustre3 (opts
CMD: trevis-20vm1.trevis.whamcloud.com lsof -t /mnt/lustre3
CMD: trevis-20vm1.trevis.whamcloud.com umount /mnt/lustre3 2>&1
mirror_io: 524: llapi_mirror_copy_many
/mnt/lustre/f200.sanity-flr: found 10 stale components
/mnt/lustre/f200.sanity-flr: resyncing mirror: 1, components: 65537 65538 65539 65540 65541
3
sanity-flr test_200: @@@@@@ FAIL: checksum error for mirror 3
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:5838:error()
= /usr/lib64/lustre/tests/sanity-flr.sh:2189:test_200()
= /usr/lib64/lustre/tests/test-framework.sh:6119:run_one()
= /usr/lib64/lustre/tests/test-framework.sh:6158:run_one_logged()
= /usr/lib64/lustre/tests/test-framework.sh:6005:run_test()
= /usr/lib64/lustre/tests/sanity-flr.sh:2194:main()
Dumping lctl log to /autotest/trevis/2019-02-26/lustre-reviews-el7_6-x86_64-review-zfs-1_17_1_62058__69de2681-ac9c-46f6-a357-cca06225620a/sanity-flr.test_200.*.1551150323.log
CMD: trevis-20vm1.trevis.whamcloud.com,trevis-20vm2,trevis-20vm3,trevis-20vm4 /usr/sbin/lctl dk > /autotest/trevis/2019-02-26/lustre-reviews-el7_6-x86_64-review-zfs-1_17_1_62058__69de2681-ac9c-46f6-a357-cca06225620a/sanity-flr.test_200.debug_log.$(hostname -s).1551150323.log;
dmesg > /autotest/trevis/2019-02-26/lustre-reviews-el7_6-x86_64-review-zfs-1_17_1_62058__69de2681-ac9c-46f6-a357-cca06225620a/sanity-flr.test_200.dmesg.$(hostname -s).1551150323.log
Resetting fail_loc on all nodes...CMD: trevis-20vm1.trevis.whamcloud.com,trevis-20vm2,trevis-20vm3,trevis-20vm4 lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null
done.
Attachments
Issue Links
- is related to
-
LU-11226 sanity-flr test 200 fails with 'checksum error for mirror 3'
-
- Resolved
-
-
LU-14966 sanity-flr test_200: FAIL: checksum error for mirror 2: lfs mirror: '/mnt/lustre/f200.sanity-flr' llapi_mirror_resync_many: Input/output error
-
- Resolved
-
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
James, I don't think that the resync errors should actually be considered test failures. If the file changes while the resync is happening, then the resync would be aborted and need to be done again. That's just how FLR currently is implemented.
However, the resync at the end of the test (after the write threads have been stopped) should properly resync the stale mirrors. It isn't clear why this test is still using "mirror_io resync" instead of "lfs mirror resync", since the latter is the tool that is used in production and is the tool we care is working properly. The mirror_io tool was a temporary FLR development tool, and its use and code should probably be removed (I am not aware of any functionality it has that is not available via "lfs mirror", but if there is we should consider moving it over.